Analytical sample = 419 · 80 images · 10 per person

Results

Three regression-based results from our ordered probit models, followed by three exploratory findings.

7.0/10
Mean score
419
Participants
80
Images tested
Result 01 — Hypothesis 1

How device type affects score

7.2/10
Laptop
6.6/10
Mobile phone

In the full sample, a t-test shows a significant difference in performance by device. Laptop users scored higher on average (7.16) than mobile users (6.63), with this difference statistically significant (p = 0.0008). Subgroup analyses for juniors and first-years were not statistically significant, likely due to smaller sample sizes, since p-values noticeably decreased as sample size increased even over the course of the three t-tests.

In the student-only ordered probit model, device type is statistically significant, with a negative coefficient for mobile phone use (coefficient = −0.29, p = 0.007). The marginal effects are negative for higher scores (7–10) and positive for lower scores (2–6), indicating that using a phone increases the probability of lower scores and decreases the probability of higher scores. Device type does not reach significance in the full-sample or faculty-only models.

Score distribution by device
Laptop Mobile
Three t-tests on score, splitting by device type
Juniors only Laptop 7.08 vs. mobile 6.79 — Δ 0.29 p = 0.510
First-years only Laptop 7.53 vs. mobile 6.95 — Δ 0.59 p = 0.229
Everyone Laptop 7.16 vs. mobile 6.64 — Δ 0.53 p = 0.0008
Result 02

Age vs score

The best-fitting ordered probit model for the full sample includes gender, race, age, and device type. Being over 25 is statistically significant (alpha = 0.05) and associated with lower performance. Specifically, individuals over 25 are 3 to 5 percentage points less likely to score 8, 9, or 10, and 1 to 2 percentage points more likely to score between 4 and 6. Device type is not statistically significant in this specification.

The affiliation breakdown makes the pattern visible: Faculty and Staff averaged 6.43/10, the lowest of any group, while students across all four class years clustered between 6.93 and 7.19. When age still emerges as a predictor within a relatively homogeneous population like Bowdoin, it suggests that exposure to AI imagery during formative years builds informal visual instincts that older users haven't developed in the same way.

Mean score by affiliation
Seniors
7.19
First-years
7.15
Sophomores
6.96
Juniors
6.92
Faculty / Staff
6.43
Result 03

Effect of AI familiarity on detection

We measured AI familiarity four different ways: a self-rated familiarity scale, a literacy scale, hours per week of AI usage, and whether someone had taken an AI course at Bowdoin. We also tested a familiarity × usage interaction. Across all measures, familiarity did not predict accuracy in the full sample or the student-only sample.

In the faculty-only model, however, AI familiarity is positive and statistically significant (coefficient = 0.439, p = 0.036). More AI-familiar faculty are associated with lower probabilities of scores in the 2–6 range and higher probabilities of scores in the 7–10 range. The effect does not appear in the full-sample or student-only models .

Mean score by AI familiarity level
Exploratory and descriptive findings
Exploratory 01

The effects of confidence on how you score

The relationship between confidence and accuracy is real but messier than you'd expect. Highly confident answers (9–10 on the confidence scale) were correct substantially more often than uncertain ones.

The takeaway: your gut is worth trusting when it's very strong, but moderate confidence is nearly meaningless as a signal. However, the hardest images, the ones that fooled the most people, were often accompanied by high confidence in the wrong direction.

There's also an asymmetry in how people erred. Participants were more consistent at correctly identifying AI-generated images as AI; true positive rates skewed toward the high end. True negative rates, correctly confirming that a real photograph is real, were more spread out and less reliable. When uncertain, people tended to call things AI rather than real.

Confidence bucket → % correct

Pooled across all image-answer pairs

Exploratory 02

Response time vs. accuracy

A linear regression of score on response time shows no relationship between time and performance (p = 0.946), suggesting that time spent on the task did not meaningfully affect accuracy.

The flat correlation (r = +0.06) holds across the full score range — the mean duration for every score level sits between roughly 400 and 500 seconds. Spending more time did not translate into better performance, and spending less time did not hurt. Detection appears to be more of a perceptual judgment rather than an analytical one.

Survey duration vs. score

Loading…

Exploratory 03

Hardest images to detect

Conclusion and discussion

Laptop users outscored mobile users by about half a point (7.16 vs. 6.63, p = 0.0008), and the effect survives in the student-only probit model (coefficient = −0.29, p = 0.007). It disappears in the full-sample model once age is added as a control, which matters because device type was not randomly assigned: first-years and juniors were directed to use phones, sophomores and seniors to laptops, and faculty and staff were asked to use either. Despite this there was still a strong element of self selection when it came to which device was used. Within the student sample, however, mean scores are similar across all four class years, which makes age an unlikely explanation for the device effect among students.

The age effect is the most stable finding. Being over 25 is associated with a 3–5 percentage point drop in the probability of scoring in the top range, and faculty and staff averaged 6.43, the lowest of any group. Younger participants likely built informal perceptual instincts from years of encountering AI-generated imagery, and those instincts show up in task performance.

AI familiarity was a null result for students and the full sample, diverging from Frank et al. (2024), who find accuracy linked to familiarity with deepfakes across multiple countries. The exception is faculty: more AI-familiar faculty scored noticeably higher (p = 0.036).

The predictors shift by population rather than stacking. Phone use is the main drag for students; AI familiarity is what separates faculty; age pulls against everyone over 25 regardless. The most pressing open question is whether the phone penalty is actually causal, since device type was not randomly assigned. A randomized controlled trial would settle it. See the about page for more details on our proposed future directions.