AI Dev 26 Field Card

28-29 Apr

At registration (ask first)

  1. Are sessions being recorded today? (AI Dev 25 sessions did appear on the DeepLearning.AI YouTube channel afterward, so the answer is likely yes, but worth confirming for this event.)
  2. When will recordings go live for paid attendees?
  3. Is there an attendee-only feed, or is the only access via the public YouTube drop?
  4. Are speaker slides distributed separately, and where?
  5. Is there an attendee directory or contact-exchange tool you can opt into for follow-up with speakers and panelists?
If recordings exist on a reasonable turnaround, you can be more selective about which talks need verbatim live capture and lean harder on the room (Q&A, hallway track, booths) for the ones that will be on tape later.

Intro line (deliver cold)

Default
"I'm a Cambridge PhD researcher here for fieldwork. I spent 25 years on developer tools at Microsoft and Google (Visual Studio, Android Studio, Ads Platform), and the puzzle I'm chasing now is what changes about how engineering teams evaluate their own work as AI moves deeper into the loop. Curious what's been surprising for your team."
Vendor booth (swap last sentence)
"Curious what your customers are running into that you didn't expect."
Recording consent
"Mind if I record this for my research notes? I'm a Cambridge PhD studying how teams evaluate AI systems."
Around 50 words, 3 breaths, 20 seconds. The Microsoft and Google line does more work than Cambridge at a developer conference, so lead with industry. "Puzzle" not "framework." Stay mechanism-quiet on the topic itself, which is what makes practitioners lean in.

Day 1 Targets (Tue 28 Apr)

Morning (confirmed times)
9:10 – 9:25
Anush Elangovan (AMD, VP Software)
Impact of AI on Software
Treat as foil capture. Get verbatim on "software becomes tokens, advantage shifts to execution velocity" and "writing code to steering intent." This is velocity stacking with no correctness variable in the frame, a cleaner Yegge-shape than the Yegge exhibit and one level deeper in the stack at the kernel layer.
9:25 – 9:40
Marc Brooker (AWS)
Keynote: The Sorcerer's Apprentice Problem (Why Agent Safety Lives Outside the Agent)
He is PSF-adjacent rather than foil material. He concedes that alignment-from-inside is losing and displaces the problem to architecture. Listen for who writes the Cedar policies and on what evidence, since the "dumber box" solution presupposes the very evaluative capacity PSF says is eroding. Quote-of-the-day candidate: "mathematically verified, not probabilistically hoped for." Tag for the collaborator file next to Celia Moore.
9:40 – 10:35
Panel: Future of Software Engineering
Catasta (Replit) · Maloney (LandingAI) · Alake (Oracle) · Reis (Practical Data Media) · Mogilko (mod.)
This is the quote-mining hour. Catasta and Maloney will run on the augmentation rail. Watch Reis specifically, since he is the closest thing to a contrarian on stage given his recent posts. Track whether the panel metabolizes Brooker's architectural framing or proceeds on Elangovan's velocity rail. If the panel ignores Brooker, the displacement is itself data.
10:35 – 10:50
Emma McGrattan (Actian)
Engineering the Context Layer (Vector Databases Across Cloud, Edge, On-Prem)
Lower PSF yield than the rest of the morning. Use this slot as a buffer to write up panel notes while the room is fresh, or skim for the deployment-topology framing if you want it on tape.
10:50 – 11:05
Keynote (Google), speaker unspecified
[Title TBA, Google product or research framing expected]
Correction from earlier: Paige Bailey is now confirmed at 4:15 PM, so this 10:50 slot is a different Google speaker, likely a Google Cloud or product team voice given the slot pattern. Listen for product-rollout language, capability claims validated by demos, and unstated counterfactuals in the productivity narrative.
11:05 – 11:25
Andrew Ng (DeepLearning.AI)
Keynote: The future of software development
The structural high-water mark of the morning's augmentation rail. Listen for productivity claims, whether internalization is addressed, and the counterfactual gains are measured against. Pair with Yegge and Elangovan in writeup. Brooker at 9:25 is the only structural pushback in the morning, so the live question is whether Ng acknowledges architectural constraints or proceeds as if Brooker did not speak.
11:30 – 12:00
Marc Manara (OpenAI)
Fireside chat
Conversational format means more unguarded rhetoric than a keynote, so this is high foil-capture potential. Listen for product-capability claims tied to specific announcements, internal adoption metrics if cited (parallel to your Anthropic self-report anchor), "AI engineer" identity language, and shipping or velocity framing. The interviewer's steering matters as much as Manara's answers.
Afternoon (confirmed times) · Stage 2 is the camp room 1:00–3:15
1:00 – 1:40 (conflict)
Harrison Chase (LangChain, Stage 2, default pick)
The Observability Flywheel: From Traces to Continuously Improving Agents
This is the talk you came for. Continuously improving against which evaluator? The flywheel metaphor accumulates momentum through proxies that get institutionalized as quality. Chase is the most influential observability voice in the field, and what he says here shapes how thousands of teams describe their own dashboards. Capture this verbatim. Conflict: Nyah Macklin (Neo4j, Stage 3) is opposite Chase with "'The AI Said So?' How to Build Auditable AI Agents Using Context Graphs." The title is more directly named at the PSF mechanism, and the audit framing is meatier than the typical session blurb. Default is Chase. If registration confirms recordings on a fast turnaround, flip to Macklin live and watch Chase on tape, since auditability framing is harder to recover from a recording without the room.
1:45 – 2:25
Anupam Datta (Snowflake, Stage 2)
Optimize Your Agent's GPA with Coding Agents
This is on-the-nose PSF territory. The title is literally about using coding agents to optimize a quality metric (GPA), which is recursive proxy optimization. Datta is more substantive than the title suggests, since he was formerly at CMU and founded TruEra (now part of Snowflake) with serious work on AI accountability. He may turn out to be PSF-adjacent rather than foil material, so tag accordingly and consider as a potential collaborator candidate.
2:30 – 2:50
Jean-Marie John-Mathews (Giskard, Stage 2)
Red Teaming LLM Applications: Systematically Finding Failures in Agents, RAG, and Chatbots
Red teaming is the externalization of evaluation, often a substitute for direct judgment when internal evaluative capacity has eroded. Listen for who decides what counts as a failure and on what evidence. The slot is short at 20 minutes, so the cost of attending is low.
2:55 – 3:15
Pratik Verma (Okahu AI, Stage 2, optional)
Observability Agent to Find & Fix Issues in AI Agents
This is recursive observability: an agent watching agents, the proxy-watching-proxy pattern in pure form. If you stayed on Stage 2 from 1:00, attend. If you need a break before coffee, skip the talk and use the time to write up.
3:15 – 3:30
Coffee break
Hallway track
Catch Chase, Datta, or John-Mathews near coffee if any of them stayed. Brooker too if he is still around. Have cards ready and be in follow-up mode.
3:30 – 4:10
Buffer slot, or Melissa Herrera (Temporal, Stage 3)
Your Agents Should Be Durable
Lower PSF yield, but "durability" is reliability rhetoric and worth a sample if energy permits. Otherwise use the slot to consolidate notes from the heavy 1:00–3:15 block before Bailey at 4:15.
4:15 – 4:55
Paige Bailey (Google DeepMind, Stage 1)
What's New and What's Next in AI
Listen for the stable-evaluator assumption, how capability claims get validated on stage, and alignment with or divergence from the Sziebert "18-Month Wall" frame. Capability talks are where unstated counterfactuals are most visible. Cross-reference with whatever the 10:50 Google speaker said in the morning.

Trap categories to listen for

AInflation
  • Significance framing the evidence cannot bear
  • False comparative ranges (X to Y productivity gain)
  • Rule-of-threes patterns in claims
  • Unearned negative parallelisms ("not just X, but Y")
BSubstitution
  • Volume as quality proxy (commits, PRs, tickets, lines)
  • Benchmark conflated with evaluation (Bean et al.)
  • Engagement metric standing in for outcome
  • Self-report taken as ground truth (METR perception gap)
CContinuity
  • Augmentation / centaur framing (Brynjolfsson, Mollick)
  • Evolution / staircase framing of discontinuity (Yegge)
  • "Just a tool" framing
  • Stable-evaluator assumption (cross-cutting)
DConcealment
  • Premature arrest (declaring done before judgment forms)
  • Differential burden absent from the account
  • "AI mindset" or posture talk without mechanism
  • Counterfactual not specified, only outcomes named
These four categories mirror the operational structure of the 21-trap repo. Exact trap labels live in the toolkit. Use the category letter when tagging in real time.

Live tag system

T-A/B/C/D trap (with category letter) F foil candidate E evidence (corroborates PSF) Q direct quote (capture verbatim) B boundary activity reference ? follow up later
Pattern in notes: TIME · SPEAKER · TAG · 6–10 words. Reconstruct in the logbook tonight, not live.

Foil-recognition prompts

  1. Is the speaker assuming the evaluator stays the same across the transformation?
  2. Is engagement framed as merely informative rather than constitutive of the metric?
  3. Is the gap framed as measurement timing rather than capacity erosion?
  4. Is the productivity claim measured against any specified counterfactual?
  5. Whose judgment validated this metric? On what evidence?
  6. Does the framing presuppose evaluative capacity is preserved?
  7. Is engagement substituted for outcome, or kept distinct?
  8. What is being lost in the account, even by implication?

Booth questions by strand

Agent platforms / dev tools (LangChain, Replit, LandingAI)
  • How do your customers know when the agent is wrong about something they could not have caught themselves?
  • Have you seen teams whose ability to evaluate output got better, or worse, over time on your platform?
  • What is the best diagnostic you have seen for whether a team is internalizing the agent's work or just shipping it?
  • When something goes wrong in production, where does the audit start, and who can read the trace?
Enterprise infra / data (Oracle, AMD, Actian, Neo4j, Arm)
  • When a customer reports the system is performing well, how do they know that?
  • Have you encountered customers who could no longer distinguish a working system from a failing one?
  • What does evaluation look like for the people who used to do this work manually?
  • Which proxy do your customers most often mistake for outcome?
Startup track
  • What does success look like that is not measurable in your first 18 months?
  • Which of your current metrics would you sacrifice if you could keep the rest?
  • Which boundary activities (developer relations, customer success, design review) became harder, not easier?
  • What is your customers' most confident wrong belief about your product?
Ask consent before recording. Field-note shorthand right after each booth, not later. One vendor at a time, no double-booth without a reset walk.

Anchor studies (one-line cite)

METR · perception gap, 39pp
Cruces et al. 2026 NBER 34851 · scaffolded, not internalized
Bean et al. 2026 Nature RCT · benchmark proxy failed
Liu et al. 2026 arXiv 2604.04721 · 3 RCTs, N=1,222, ~10min AI exposure reduced persistence and independent performance, causal
Kim & Kang 2025 OS · predictions degrade reasoning
Acemoglu et al. 2026 NBER · knowledge collapse
Koren et al. 2026 · Tailwind: downloads up, revenue −80%
Anthropic, OpenAI, Uplevel, DORA · supporting self-report and engagement-metric evidence