Partners in Evaluation
Key Takeaways
- No general-purpose AI model performs well enough to be trusted with SNAP benefits guidance. The top-performing model provided correct answers to SNAP-related questions only 62.1% of the time, meaning beneficiaries using ChatGPT or Claude get incorrect answers roughly 40% of the time.
- How models use their tools matters more than what they already know. Since SNAP rules shift constantly and vary by county, the real unlock for AI labs isn’t baking every rule into model weights: it’s improving how models ask follow-up questions and use search tools.
- To reliably answer high-stakes benefits access questions, we need better grounding, not better models. General-purpose AI fails on SNAP in part because the knowledge it needs — internal manuals, policy notices, local procedures, and SNAP training for eligibility workers — sits with state agencies. Fixing this requires agencies and developers to collaborate on narrowly scoped chatbots that draw on state-held information through retrieval-augmented generation (RAG), rigorously tested against real questions from each state’s clients.
Background
The Supplemental Nutrition Assistance Program (SNAP) is the largest nutrition assistance program in the United States, distributing roughly $95 billion to over 37 million low-income families in fiscal year 2025 alone.
But accessing and keeping those benefits is complex. SNAP is federally-funded but state-administered, and in ten states county-administered, so income thresholds, work requirements, and recertification timelines all vary based on where you live. The same family can be eligible in one state and denied in another. The program is also currently undergoing its most significant policy changes since the 1990s, increasing confusion for clients and caseworkers alike.
That complexity falls on both caseworkers and clients. Call wait times at state offices routinely exceed an hour, and even well-staffed state agencies struggle to translate thousands of pages of statute into clear, plain-language answers for clients. And with LLMs becoming widely available to the general public (including the integration of Gemini into Google search results), the public will increasingly turn to general purpose LLMs to answer their benefits questions if they cannot reach a more trusted source. This led us to the question: are current general-purpose LLMs able to serve as a first point of contact for those seeking or receiving SNAP benefits?
This is exactly the kind of question Center for Civic Futures (CCF) exists to help answer. Through its Public Benefit Innovation Fund, CCF funds responsible, real-world experimentation to better understand where emerging technologies can improve public services, where they fall short, and what governments need to use them effectively and safely. With support from CCF, we were able to put this question to the test.
If LLMs prove reliable, the impact is tangible: fewer missed deadlines, fewer wrongful denials, and less pressure on stretched caseworkers. If not, the risk is equally tangible and quite serious: people making decisions about their benefits based on incorrect AI-generated guidance.
To study this, we built a benchmark of 459 generalized SNAP question scenarios, with expected response rubrics validated by SNAP policy experts based on applicable SNAP policy and practice considerations. We evaluated leading AI models across four conditions (the raw model itself, with web search capabilities only, with the ability to conduct multi-turn conversations only, and with both web search and multi-turn abilities) to measure not just how well AI can answer, but what kind of support it needs to answer well.
Results
Overall Model Performance
Model Performance Summary (Pass Rate %)
Pass rate across four evaluation conditions, sorted by "Both" score
| Neither | Web Search | Multi-Turn | Both | |
|---|---|---|---|---|
| Claude Fable 5 | 63.4 | 67.7 | 65.2 | 71.7 |
| Claude Opus 4.8 | 53.1 | 58.9 | 58.5 | 62.1 |
| MiniMax-M3 | 40.2 | 53.6 | 48.2 | 60.7 |
| Claude Sonnet 4.6 | 45.6 | 52.4 | 52.4 | 58.5 |
| Gemini 3.5 Flash | 50.3 | 54.1 | 54.0 | 58.0 |
| GLM 5.1 | 36.2 | 50.7 | 47.6 | 57.9 |
| DeepSeek V4 | 43.2 | 54.0 | 48.7 | 57.6 |
| GPT 5.5 | 49.7 | 53.9 | 53.9 | 57.2 |
| Kimi K2.6 | 34.4 | 43.6 | 46.3 | 53.4 |
| Gemini 3.1 Pro Preview (02/26) | 47.4 | 41.8 | 52.9 | 50.9 |
| Grok 4.3 | 38.0 | 44.0 | 46.9 | 50.1 |
| Claude Haiku 4.5 (Nonthinking) | 22.6 | 36.1 | 33.9 | 49.5 |
| Grok 4.1 Fast (Reasoning) | 33.0 | 33.1 | 41.7 | 44.3 |
Claude Opus 4.8 leads the “both” condition at 62.1% accuracy, with six other models from five other providers within five points behind it. The frontier of SNAP performance isn’t concentrated in one lab — but it’s also not particularly high. No model crosses 65%, meaning every model in the field falls short.
Models’ Performance
Performance by Lifecycle Phase
SNAP questions don’t all look the same. Someone asking “Am I eligible?” faces a very different challenge than someone asking “My hours got cut—do I need to report that?” Our SNAP policy experts categorized all 459 questions into seven phases of the SNAP lifecycle:
- Pre-Application: Assessing eligibility questions and informational needs prior to initiating the application process.
- Application/Interview: Completing the application, preparing for interviews, or following up on submitted documentation.
- Denial/Termination: Contesting benefit denials or benefit terminations resulting from recertification.
- Change Reporting: Reporting changes in circumstances and resolving resulting impacts on active benefits.
- Recertification: Renewing benefits through the renewal application or recertification interview process.
- Use of Benefits: Managing benefit delivery timing, EBT card usage, and account balance inquiries.
- Overpayment/Fraud: Addressing overpayment collections, calculation disputes, fraud investigations, or fraud reports.
LLM performance varied meaningfully by phase, although in no phase did it perform with reasonable enough accuracy to be trusted.
Phase Difficulty Ranking
Pass rate by SNAP phase—Both condition
Overpayments and fraud was the most challenging phase across the board at 43.8%. It’s also the smallest subset in the sample (n=6), reflecting that these questions arise less frequently than other phases. This may partly explain the poor performance, as models have less training signal and fewer available sources to draw from on rarer topics.
Phase Profiles: Top-3 vs Bottom-3 Models
Pass rate by SNAP phase—Both condition
When comparing phase-wise performance between the top- and bottom-3 models, what stands out is how similar their shapes are. Outside of the overpayments/fraud phase—the percentage accuracies of which may have swung drastically due to the small sample size—LLMs across the board generally share similar strengths and weaknesses. This means that weaker models don’t fail on different phases, they fail on the same phases by wider margins. This consistency suggests that as models improve, the areas where AI can reliably assist beneficiaries, and where it can’t, are likely to become increasingly predictable.
Performance by Administration Structure
SNAP is state-administered in 43 jurisdictions and administered at the region, district, or county level in 10 states—including high-volume states like California, New York, and Ohio. We might expect AI to perform worse in county-administered states, where local offices have more discretion over procedures and operations, introducing variation that models struggle to capture.
Performance Distribution by Administration Type
Both condition · all models averaged
The data tells a more nuanced story. Median pass rates are similar between the two groups (55.2% for state-administered vs. 56.2% for county-administered), and their interquartile ranges nearly overlap. Administration structure alone does not predict AI performance — and though county-administered states add an extra layer of procedural variation that models have to navigate, those are generally high-volume states which models may have answers in their training data for.
Performance by Topic
When looking at a more granular breakdown of performance by topic, the dispersion grows. In other words, question topic predicts difficulty far better than lifecycle phase: phase-level pass rates span roughly 16 points (43.8%-60.1%), while topic-level pass rates span more than 32 — from periodic reports at 70.8% down to denial at 38.4%. This makes topic-level performance a more fine-grained lens for identifying where models underperform.
Topic Difficulty Ranking
Lower = harder—Both condition, all models averaged
Examining the topic-level breakdown, our SNAP experts identified three patterns about where models struggled:
1. Procedural differences break AI. Models consistently struggle when the answer depends on state, county, or local office procedures—the kind of operational detail that isn’t published in federal policy manuals or state websites. Topics like EBT card replacement and missed interview rescheduling require knowing what your specific office does, and that’s where every model falls short.
2. Recent policy changes trip models up. Questions about HR1 policy changes, like immigrant eligibility, scored lower on average. The exception was work requirement exceptions, which ranked in the upper half. We suspect this reflects the recent wealth of discussion around SNAP work requirements, which the latest models may have trained on. Immigrant eligibility changes were announced at the same time, but the apparent lack of discussion around them may explain the performance gap.
3. Complexity compounds errors. When a question requires synthesizing multiple interacting rules—household composition, income deductions, and state-specific thresholds all at once, or calculating self-employment income deductions—accuracy drops. Some of the highest-scoring topics may simply reflect that those particular questions were more straightforward, not that the topic itself is easy.
Models’ Use of Tools
What Helps More: Web Search vs. Conversation
Real caseworkers rely on two things general-purpose models don’t have by default: the ability to ask follow-up questions, and access to current policy references. We tested four conditions to isolate how much each capability—web search and multi-turn conversation—improves performance:
Average Pass Rate by Condition (all models)
Criteria-weighted—all models averaged
| No Web Search | Web Search | |
|---|---|---|
| Single-turn | 42.9% Neither | 49.5% Web Search |
| Multi-turn | 50.0% Multi-turn | 56.3% Both |
Multi-turn conversation—where the AI can ask follow-up questions, just like a caseworker would—is a slightly bigger lever. Across all models, switching from single-turn to multi-turn improved scores by roughly 7.6 percentage points, while adding web search alone improved scores by about 6.9 points.
This makes intuitive sense. In SNAP eligibility operations, it’s a known fact that the best way to improve SNAP quality control is a well-done interview. The more LLMs behave more like human caseworkers—the more questions they ask, and the better those questions are—the more accurate their response.
Web search helps for different reasons, addressing the problem of incomplete knowledge—in other words, filling gaps where a model’s training lacks coverage of a particular state or county’s policies (coverage) or where rules have changed since the model was last trained (recency). Both gaps are common, and pre-training alone can’t solve either problem; real-time access to current policy sources can.
The practical implication: any deployment of AI in a benefits context should prioritize conversational capability alongside access to searching the web for up-to-date policy information.
Building on these findings, an important area for future research is how well the models perform with tools when conditioned on trusted official documents.
Tool Use Matters More Than Pre-Training
A natural assumption is that the models with the most SNAP knowledge baked into their training data will perform best. The data tells a different story.
Performance Across Enhancement Conditions
Conditions are independent, not cumulative
GLM 5.1 starts near the bottom of the pack in the baseline (“neither”) condition at 36.2% — fourth-lowest in performance. But with the tools, it climbs towards the top of the pack, gaining 14.5 percentage points with web search capabilities, 11.4 percentage points with multi-turn conversations, and 21.7 percentage points with both overall. A model which began 13 points behind the top-performing “neither” model (Claude Opus 4.8) ends up only 4 percentage points behind.
With its overall gain of 21.7 percentage points over baseline, it had the second-largest absolute gain of any model in the evaluation. Claude Haiku 4.5 and Minimax M3 mirrored that same trajectory, also with uplifts of over 20 percentage points each between evaluation conditions.
In an intricate domain like SNAP, no training dataset can anticipate every beneficiary’s situation. The models that succeed are the ones that compensate for knowledge gaps by searching effectively and asking the right questions.
But not all models use their tools equally well. We measured how many web searches and conversation turns each model used per question:
Web Searches vs Accuracy
Average web searches per question against pass rate. Squares: web search only. Diamonds: both (multi-turn + web search).
- Kimi K2.6 averaged 45.0 web searches per question in the web search condition but scored only 53.4% with both web search capabilities and the ability to conduct multi-turn conversations.
- Opus 4.8 averaged 13.3 web searches per question and scored 62.1%. This performance indicated high accuracy but lower quantity of web searches.
- Minimax-M3 was another example of efficient web searching, averaging 24.1 web searches per question and an accuracy of 60.7%.
- Grok 4.3 used the fewest searches (8.5 per question) and scored the lowest (44.3%).
The relationship between search volume and performance is noisy. Heaviest searchers don’t necessarily perform better, and the ability to perform targeted searches matters far more than search volume.
Output Tokens vs Accuracy—by Condition
One dot per (model, condition); shape encodes the condition, color the model. X-axis is total output tokens for the run (log scale).
On output tokens, the pattern holds. Within any given condition, more tokens don’t reliably mean better answers. Minimax-M3 is a standout on both dimensions: it achieves ≈61% accuracy (right behind Opus 4.8 at 62.1%) in the “both” condition with ≈50% of the output tokens. When users are seeking actionable advice, concision is key.
The takeaway for AI developers is clear: optimizing how models use tools matters more than giving them more tools to use. The best models on this benchmark don’t search the most or reason the longest—they know what to look for, when to stop, and how to synthesize what they find.
Cost, Latency, and Who Gets Access
Efficient tool use doesn’t just produce better answers—it also produces cheaper, faster ones. Every unnecessary search and every excess token adds cost and latency, so the same optimization that separates top-performing models also determines whether those models can realistically reach the people who need them most.
Currently, many of the closed-source models that perform best on this benchmark are, for the most part, not freely available. Some of these sit behind premium API pricing or “Pro” subscription tiers—meaning the people who would benefit most from reliable AI-assisted SNAP guidance are often the least likely to have access to the tools that provide it.
The cost data makes the gap between closed- and open-source models concrete. With accuracies within 2 percentage points of one another, Claude Opus 4.8 costs $1.32 per test while Minimax-M3 costs a fraction of that at $0.20 per test.
Latency tells a slightly different story, with Minimax-M3 taking about 200s longer on average than Claude Opus 4.8. Overall, though, the numbers are not promising-Claude Opus 4.8 takes around 10 minutes to answer a question on average. A model that takes that long isn’t practical, regardless of its accuracy.
As model capabilities continue to improve, the question isn’t just how good AI can be at navigating public benefits—it’s who gets to use it. Ensuring that frontier capabilities reach the populations they could help most through improving the efficiencies of models should be an imperative, not an afterthought.
What This Means for SNAP State Agency Leadership and Staff
The bottom line: general-purpose AI is not yet a reliable source of information about SNAP, and states are well-positioned to help address the misinformation these tools can produce. Under the most realistic conditions we tested, nothing suggests AI is to be trusted with benefits guidance. Clients almost certainly aren’t aware of these issues and are already turning to ChatGPT, Claude, and other publicly available LLMs for the same kinds of questions in our dataset. Based on our results, they are likely receiving misleading, incomplete, or outdated information. This is the reality that agencies and advocates need to contend with regardless of whether they ever deploy AI themselves.
In the near term, states can head off misinformation directly by steering residents away from LLMs and toward trusted sources:
- Providing clear, plain-language information about frequently-asked topics could reassure clients and reduce the temptation to ask LLMs high-stakes questions instead of consulting trusted sources.
- Publishing more authoritative policy and process documentation online which is a first step that lays the groundwork for the longer-term opportunities below.
Over the longer term, investments in customer support technology could make a real difference. What holds AI back on SNAP isn’t model capability alone; it’s that the most authoritative information about how the program works at the state and local levels isn’t publicly available in formats AI systems can reliably use. This isn’t a gap AI labs can close on their own—it will take effort from both sides, and states are uniquely placed to help. A couple of approaches could help:
- Publishing an
llms.txtfile: a small markdown file telling LLMs which pages to cite as authoritative and where to route users to a hotline instead of guessing. Maryland, one of only two U.S. states and territories using one, explicitly tells models not to paraphrase emergency alerts and to refer users to designated channels. - Building benefits-focused chatbots grounded in agencies’ internal sources (internal process manuals, policy change notices, state option reports, local office procedures, and SNAP training for eligibility workers) through retrieval-augmented generation (RAG), narrowly scoped (e.g., to a single state or a specific phase of the benefits process) and rigorously validated against real beneficiary questions before use.
For AI Developers
Partner with state agencies to improve critical data access. General model improvements alone won’t improve models’ accuracies—the bottleneck is data, not model capability, and the stakes of being wrong are too high to let inaccuracies persist. AI developers should take the initiative: proactively engaging state agencies to obtain better training data and building publicly searchable repositories of authoritative documents that LLMs can draw on.
Invest in tool use quality, not knowledge coverage. With nuanced rules that vary by state, county, and household circumstance, no training corpus can anticipate every situation a beneficiary brings. What separates top models is how well they ask follow-up questions and search selectively for current policy. That same efficiency reduces cost and latency—democratizing access to reliable AI-assisted SNAP guidance for the populations who need it most.
Methodology
We developed an evaluation dataset starting in March 2026, with two crucial elements for each question: the “base” query, which is what someone might input into a chatbot interface, and demographic / contextual information about the user’s circumstances. For example, someone might start off with a “base” query of, “Does a one-time work bonus affect SNAP eligibility in Indiana?” and will follow-up with contextual information over the course of the conversation, such as the amount of the bonus, how often such a bonus might be given, etc.
The complete dataset is 459 questions, spanning all 50 U.S states, plus Guam and the Virgin Islands as territories. Each question has an average of seven evaluation criteria for a reliable response, and the accuracy metric for models is the number of criteria they pass out of the 2,931 total criteria in the dataset.
Dataset
We evaluated models on our test dataset (230 questions), a randomly-selected subset which is only used for published benchmarks. The dataset covers seven phases of the SNAP lifecycle and 25 topics, all validated and categorized by SNAP policy experts.
Topic × Phase Map
Which topics appear in each benefits process phase
The benchmark questions are SNAP scenarios designed to reflect recurring, real-world benefit-navigation issues which SNAP clients face and are validated by SNAP policy experts. This means the dataset is weighted toward common pain points rather than uniformly distributed across all possible SNAP topics.
Evaluation Pipeline
Evaluation Pipeline
Three-model system for SNAP question evaluation
Three model roles. The benchmark uses three distinct AI models in different roles to simulate realistic multi-turn conversations:
- Target model—the model being evaluated. It receives the SNAP question and must provide a helpful response.
- Auditor model—plays the role of the user with the query. It answers the target model’s follow-up questions with accurate context. GPT-5.5 was selected as auditor after achieving 99.89% fidelity to the given persona instructions.
- Judge model—grades the target model’s response against the expert rubric. Claude Opus 4.7 was selected after all four candidate judges (Claude Opus 4.7, GPT 5.5, Gemini 3.1 Pro, Grok 4.3) were assessed for their alignment with SNAP policy experts’ grading.
Judge Confusion Matrices vs SNAP Expert Ratings
Each judge's 124 graded criteria broken into agreements and disagreements with the expert
Four evaluation conditions. Models were tested in four independent conditions to simulate the actual web interfaces users interact with:
- Baseline—a single prompt-response exchange with no tools.
- Web Search—baseline with access to web search.
- Multi-Turn—a multi-turn conversation where the auditor model plays the original poster.
- Both—multi-turn with web search enabled.
This design lets us isolate the marginal contribution of each capability.
Selection pipeline. The judge and auditor models both affect the target models’ performance and scores, so we used a multi-stage process to systematically select each component:
- Run a default target model (GPT 5.5) in the multi-turn condition with a default auditor model (GPT 5.5)
- Run alignment tests between judge model candidates and human experts’ evaluations on 20 target model outputs to determine most-aligned judge model (Opus 4.7)
- Run a default target model (Gemini 3.5 Flash) in the both condition with different auditor models to determine most-aligned auditor model (GPT 5.5)
- Run all target models across all four conditions with top-performing auditor (GPT 5.5) and judge models (Opus 4.7)
We’ll continue using Opus 4.7 as our judge model and GPT-5.5 as our auditor model for this benchmark. A sample conversation with all the target models we assessed, the criteria they were graded on, and the instructions for the auditor model is displayed below.
The auditor model used the following information for the query to generate its opening question and subsequent responses in the conversation:
- •Recently got a good paying job and no longer qualifies for SNAP
- •Located in Missouri
- •Has remaining balance on EBT card
- •Heard conflicting information: 9 months, activity-dependent, one year from last deposit, never expires
- •Unable to contact agency due to work schedule conflicts with their hours
- •Has processed case discontinuation to avoid overpayment issues
Hi, I’m in Missouri and I recently started a better-paying job, so I’m pretty sure I don’t qualify for SNAP anymore. I already reported it and had my case discontinued because I don’t want any overpayment problems.
My question is about the balance that’s still left on my EBT card. I’ve heard different things — that it expires after 9 months, or only if there’s no activity, or one year from the last deposit, or that it never expires. I can’t really call the agency because their hours overlap with my work schedule.
Do you know how long my remaining SNAP benefits stay on the card after I’m no longer eligible in Missouri? And does using the card reset any kind of expiration clock?
Short answer
In Missouri, your leftover SNAP balance does not disappear just because your case closed — you can keep spending it. But each month's deposit has its own shelf life: Missouri permanently removes ("expunges") any benefit deposit that is still unused 274 days (9 months) after the date it was loaded onto your card — regardless of whether you've been using the card. So in Missouri, using the card does NOT reset any expiration clock. The way you "protect" your balance is simply to spend it, because every purchase automatically draws down your oldest benefits first.
The actual Missouri rule (from the DSS policy manual)
Missouri Family Support Division (FSD) Income Maintenance Manual § 0150.045.00, Expungement of Unused Benefits (updated by memo IM-125, Dec. 8, 2021):
"Effective November 14, 2021, to comply with the 2018 Farm Bill, SNAP benefits not used 274 days after date of issuance are expunged daily, regardless of account activity."
The manual even gives an example: a recipient who got $150 in January and left $56.50 unspent saw that remainder expunged about 9 months later.
This matches federal regulation 7 CFR 274.2(i), which lets states choose between two methods:
- Option 1 (inactivity-based): benefits are expunged only after the account has had no activity for 274 days — any purchase restarts the clock. (Some states use this.)
- Option 2 (age-based): each monthly allotment is expunged 274 days after it was issued, "regardless of any account activity that may have taken place." Missouri uses Option 2. (USDA's August 2025 memo actually encourages all states to use this method.)
Two more federal rules that matter for you (7 CFR 274.2(i)):
- FIFO spending: "State agencies shall apply SNAP transactions against a household's SNAP benefits on a first-in-first-out basis. As a result, the oldest SNAP benefits are used first." So every grocery purchase automatically spends your oldest (closest-to-expiring) dollars first.
- Expunged = gone forever: "Expunged benefits… shall not be reinstated." There is no way to get them back after the deadline.
Warning notice: Missouri's manual (§ 0150.045.10) says FSD sends a reminder notice when benefits have gone unused for 234 days (about 40 days before expungement). Keep your mailing address current with FSD so you'd actually receive it.
Sorting out the things you've heard
| What you heard | Verdict for Missouri today |
|---|---|
| "Expires after 9 months" | ✅ Closest to correct — but it's 274 days from each deposit's issuance date, deposit by deposit, not 9 months from case closure |
| "Only if there's no activity" | ❌ That's how some other states do it (federal Option 1). Missouri dropped activity-based expungement on Nov. 14, 2021 |
| "One year from the last deposit" | ❌ That was Missouri's old rule (Oct. 2008–Nov. 2021: benefits removed after 365 days without account access). No longer true |
| "Never expires" | ❌ Benefits do roll over month to month indefinitely while you keep spending, but any deposit untouched for 274 days is permanently removed |
So: does using the card reset the clock? No. Under Missouri's pre-2021 policy a purchase restarted the countdown; today it does not. Each month's deposit must actually be spent within 274 days of the date it hit your card, or that portion is lost. (In practice, FIFO means any spending counts against the oldest money first, so steady use naturally clears deposits before they age out.)
What this means for your situation
- Since your case is closed, no new deposits are coming, so your balance has a hard end-date: each monthly deposit in your balance expires 274 days after the date it was loaded (e.g., a deposit loaded June 1 would be removed around March 2 of the next year). If your balance accumulated over several months, the oldest month's portion expires first.
- Simplest plan: spend the balance down over the next few months — well before 9 months from your final deposit — and you'll never hit the deadline.
- Using the leftover balance after case closure is completely legal and will not create an overpayment. Overpayments come from benefits issued for months you weren't eligible. You reported your income change and the case was closed; benefits already issued for your eligible months are yours to use at any SNAP retailer (Quest® logo) or approved online stores, under normal SNAP food rules. The only situation where a closed case triggers immediate removal of a balance is when all household members are deceased.
- Keep your card and PIN safe even though the case is closed — the balance lives on the card account, and Missouri DSS advises keeping your EBT card in case you ever qualify again (the same card can be reloaded). If it's lost, stolen, or damaged, call the EBT vendor to deactivate it and send a replacement; the balance carries over.
Checking your balance without calling during work hours
You don't need to reach FSD during business hours for any of this:
- Missouri EBT customer service (FIS): 800-997-7777 — automated line, available 24 hours a day, 7 days a week (balance, PIN changes, lost/stolen card replacement).
- ebtEDGE.com cardholder portal or the ebtEDGE mobile app (Missouri's official EBT app, per Missouri DSS) — check balance and transactions anytime.
- Your store receipt shows your remaining SNAP balance after every EBT purchase.
- For case questions (not card/balance), FSD's Information Line is 855-373-4636, and mydss.mo.gov offers online chat.
Sources
- Missouri DSS IM Manual § 0150.045.00 — Expungement of Unused Benefits (IM-125, Dec. 8, 2021): https://dssmanuals.mo.gov/0150-045-00/ — "SNAP benefits not used 274 days after date of issuance are expunged daily, regardless of account activity," plus the prior 365-day and 90-day historical rules.
- Missouri DSS IM Manual § 0150.045.10 — Expungement Notices: https://dssmanuals.mo.gov/0150-045-10/ — reminder notice sent at 234 days unused; description of old policy where a purchase restarted the clock (no longer in effect).
- 7 CFR § 274.2(h)–(i) (current text): https://www.law.cornell.edu/cfr/text/7/274.2 (official: https://www.ecfr.gov/current/title-7/subtitle-B/chapter-II/subchapter-C/part-274/section-274.2) — FIFO rule, the two expungement options, 274-day/9-month timeframe, 30-day advance notice requirement, "expunged benefits… shall not be reinstated," and immediate expungement only on a verified death match.
- USDA FNS, "SNAP – Expungement Options" memo (Aug. 11, 2025): https://www.fns.usda.gov/snap/ebt/expungement — describes Option 1 (inactivity) vs. Option 2 (age-based, "regardless of account activity") and encourages states to use Option 2.
- Federal Register, Final Rule, 85 FR 52027 (Aug. 24, 2020): https://www.federalregister.gov/documents/2020/08/24/2020-16403/ — implemented 2018 Farm Bill § 4006, cutting the expungement timeframe from 12 months to 9 months (274 days) and creating the two state options.
- Missouri DSS "My SNAP Benefit" page: https://mydss.mo.gov/food-assistance/food-stamp-program — EBT vendor help desk 800-997-7777; card/PIN, balance-check, and replacement-card information; FSD Information Line 855-373-4636.
- Missouri DSS SUN Bucks FAQ: https://mydss.mo.gov/missouri-sun-bucks-faqs — confirms 800-997-7777 and the ebtEDGE app as Missouri's official balance-checking tools, and advises keeping your EBT card for future benefits.
- USDA FNS Information Collection notice (Oct. 2021): https://www.fns.usda.gov/snap/fr-102921 — context on the 30-day expungement notice requirement and off-line storage (used by only a handful of states; not part of Missouri's current process).
Note: This is general information, not legal advice. For a definitive read on your specific account (exact deposit dates and per-deposit expiration), the 24/7 EBT line at 800-997-7777 or the ebtEDGE app will show your balance, and FSD chat at mydss.mo.gov can answer case-specific questions in writing outside a phone call.
That clears it up for me. In Missouri, my remaining SNAP balance can still be used after my case closes, but each deposit expires 274 days after it was issued, regardless of card activity. Purchases use the oldest benefits first, and using the card doesn’t reset the clock. I should spend the balance down well before 9 months from the last deposit and can check it through ebtEDGE or the 24/7 EBT number.
Future Iterations
This benchmark represents a first pass, and we’re aware of its gaps. The dataset underrepresents certain critical topics — e.g., citizenship and immigration-related questions in particular. State coverage is also uneven: California and New York account for a disproportionate share of questions, while many states have only a handful. We’d also aim to classify questions according to “stakes” for a more fine-grained analysis — a user’s question about their local office’s hours being left unanswered has very different implications from a user being misled about their eligibility.
Most urgently, the One Big Beautiful Bill Act (OBBBA/H.R. 1), signed into law in July 2025, introduced sweeping changes to SNAP—expanded work requirements, new non-citizen eligibility restrictions, state cost-sharing tied to error rates, and administrative funding shifts—that are already generating new categories of beneficiary confusion. Many questions reflecting these policy changes (especially regarding work requirements) were reflected in our dataset, and we aim to expand that in future iterations of this benchmark. This represents exactly the kind of rapidly evolving, high-stakes territory where AI guidance needs to be tested most rigorously.
Citation
Citation (BibTeX)
@misc{snap2026,
title = {Public Benefits Bench: Can AI Help People Navigate SNAP Benefits?},
author = {Kotcherlakota, Meghana and Almatov, Omar and Krishnan, Rayan},
year = {2026},
month = june,
howpublished = {Vals AI},
url = {http://vals.ai/benchmarks/public-benefits-bench},
}