About — AI Detection Study

Research motivation

We examine two main questions. 1) Does the device used to view an image affect a person's ability to accurately identify whether the image is AI-generated? 2) What other factors influence detection accuracy?

AI-generated images have become increasingly difficult to distinguish from real photographs, and people encounter them frequently in news feeds, advertising, social media, and journalism. Much of this exposure occurs on mobile devices. It is therefore important to assess whether mobile users are systematically less accurate at detecting AI-generated images. Identifying the factors that predict detection performance is important for understanding and improving AI literacy.

Hypothesis

H1 Partially confirmed

Viewing on a mobile phone, compared to a laptop, decreases detection ability.

The t-test for the full sample supports this result, with average scores of 7.16 for laptop users and 6.63 for mobile users (p = 0.0008). Subgroup analyses for Juniors and First years were not statistically significant, likely due to smaller sample sizes, as the p-value steadily decreased with larger samples. For the ordered probit model estimated on the full sample and faculty sample, device type did not have statistically significant marginal effects. However, when we reran the models using the only student sample, using a phone rather than a laptop was associated with lower scores and highly statistically significant.

Literature review

Research consistently shows that humans are near chance at detecting AI-generated media. Nightingale and Farid (2022) find that AI-generated faces are almost indistinguishable from real ones, with detection accuracy around 48 percent, and are sometimes rated as more trustworthy. Frank et al. (2024) report similar results across multiple countries, with accuracy below 50 percent and performance linked to cognitive reflection and familiarity with deepfakes. Pehlivanoglu et al. (2026) also find near-chance performance for images, although humans outperform algorithms on video deepfakes, with better results among individuals with stronger analytical thinking.

These results may overstate real-world performance because laboratory settings differ from everyday image viewing. Josephs et al. (2023) show that distraction, lower image quality, and shorter exposure times reduce accuracy from 73.3 percent to as low as 54.8 percent. Kamali et al. (2025) similarly find that accuracy drops from about 82 percent with longer viewing time to 72 percent with brief exposure, and that human-curated AI images are harder to detect. Cooke et al. (2024) report average performance of 51.2 percent in more realistic conditions and note that smartphone use may further reduce accuracy due to lower cognitive resources compared to desktops.

This result highlights a device-related issue that has started to receive empirical attention. Sütterlin et al. (2023) find that deepfakes are less accurately identified on mobile phones and tablets than on stationary computers, with mean ranks of 63.90 compared to 82.94. They suggest this difference may be due to smaller screen sizes and reduced situational awareness on mobile devices. Although device type was not the main focus of their study, the finding aligns with concerns that device characteristics may systematically influence detection accuracy and motivates further investigation into this relationship.

Other studies point to mechanisms that could explain this pattern. Mograbi (2022) finds that smartphone users show greater present bias and are less likely to wait for information. Figl and Remus (2023) find no general accuracy differences under random assignment, but faster intuitive responses on smartphones, suggesting that differences in observational studies may reflect self-selection. Huff (2015) shows that mobile interfaces increase cognitive load through scrolling and resizing, which can reduce decision accuracy. Ward et al. (2017) further find that the presence of a smartphone reduces working memory and fluid intelligence even when not in use. Together, this literature suggests that mobile device use is associated with reduced cognitive resources and more intuitive processing, both of which are linked to lower detection accuracy.

References

Cooke, D., Edwards, A., Barkoff, S., & Kelly, K. (2024). As good as a coin toss: Human detection of AI-generated images, video, audio, and audiovisual stimuli. arXiv preprint arXiv:2403.16760.
Figl, K., & Remus, U. (2023). Thinking fast and thinking slow: Digital devices' effects on cognitive reflection. Journal of Management Information Systems, 40(2), 580–623.
Frank, J., Herbert, F., Ricker, J., Schönherr, L., Eisenhofer, T., Fischer, A., Dürmuth, M., & Holz, T. (2024). A representative study on human detection of artificially generated media across countries.
Huff, K. C. (2015). The comparison of mobile devices to computers for web-based assessments. Computers in Human Behavior, 49, 208–212.
Josephs, E. L., Fosco, C. L., & Oliva, A. (2023). Artifact magnification on deepfake videos increases human detection and subjective confidence. arXiv preprint arXiv:2304.04733.
Kamali, N., Nakamura, K., Kumar, A., Chatzimparmpas, A., Hullman, J., & Groh, M. (2025). Characterizing photorealism and artifacts in diffusion model-generated images. CHI Conference on Human Factors in Computing Systems.
Mograbi, E. (2022). Decision-makers are more impulsive on smartphones than on computers. Journal of Behavioral and Experimental Economics, 100, 101916.
Nightingale, S. J., & Farid, H. (2022). AI-synthesized faces are indistinguishable from real faces and more trustworthy. Proceedings of the National Academy of Sciences, 119(8), e2120481119.
Pehlivanoglu, D., Zhu, M., Zhen, J. et al. (2026). Is this real? Susceptibility to deepfakes in machines and humans. Cognitive Research: Principles and Implications, 11, 3.
Sütterlin, S., Ask, T. F., Mägerle, S., Glöckler, S., Wolf, L., Schray, J., ... & Lugo, R. G. (2023). Individual deep fake recognition skills are affected by viewer's political orientation, agreement with content and device used. In HCII 2023 (pp. 269–284). Springer Nature Switzerland.
Ward, A. F., Duke, K., Gneezy, A., & Bos, M. W. (2017). Brain drain: The mere presence of one's own smartphone reduces available cognitive capacity. Journal of the Association for Consumer Research, 140–154.

Search strategy

Databases. Google Scholar, ResearchGate, arXiv.

Keyword strings.

"deepfake" ("screen size" OR "display size" OR "small screen") (detection OR discernment)
("deepfake" OR "synthetic video" OR "AI-generated video") AND ("user study" OR "human subjects" OR participant* OR crowdsource*) AND (detect* OR discern* OR "real vs fake" OR authenticity) AND (compression OR resolution OR downsampling OR "video quality")

Funnel. The first search returned approximately 30 papers based on title and keywords. Abstract review narrowed that to 11. Four became the core of the literature review and directly informed the hypotheses and the gap this study was built to fill.

Methodology

Image dataset

Many existing image datasets were not suitable for this study because they were designed to train machine learning algorithms rather than to evaluate human perception. This distinction is important because such datasets typically contain more than 10,000 images per category, which is far beyond what is practical in a survey where each participant views only about ten images. Additionally, most publicly available datasets were not appropriate because the images either contained obvious AI artifacts that were easy to identify or consisted of paintings and artworks that would be too hard to identify. Neither type of dataset provides useful insight into typical human detection ability. We assembled a dataset of 80 images (40 real, 40 AI-generated) curated from a Kaggle dataset of 12,000 images (6,000 per category). Selection criteria aimed for a middle difficulty range.

Survey design

Each participant viewed 10 randomly selected images from the dataset, with attention checks included after the third and sixth images. For each image, participants indicated whether it was a real photograph or an AI-generated image and rated their confidence on a scale from 1 to 10. After completing the image detection section, participants answered questions about AI use and provided demographic information.

Participants

We collected 508 complete responses from Bowdoin College students, faculty, and staff. Participants were recruited through class year email lists, with first-years and juniors encouraged to use mobile phones and sophomores and seniors encouraged to use laptops, as well as through a Student Digest announcement. The sample is a convenience sample and is broadly representative of the Bowdoin community, but it skews younger (ages 18–24) and more academically engaged compared to the general U.S. population.

The analytic sample (n = 419) included seniors (108), first-years (93), sophomores (89), faculty and staff (74), and juniors (54). By device, 260 participants completed the survey on a laptop and 157 on a mobile phone. With an estimated target population of about 3,200 individuals, the 508 responses correspond to a margin of error under 5 percent at the 95 percent confidence level, which is sufficient for making population-level inferences about the Bowdoin community.

Data

Each participant's score (ranging from 2 to 10 in the observed data) represents the number of images they correctly classified out of 10. The dataset is a modified, anonymized version of the 508-response dataset available on the project website. The analytic sample used in the statistical models includes 419 observations after excluding 89 responses based on three criteria: failed attention checks, removing devices other than laptop or phone, and response time outliers beyond ±3 standard deviations. The score distribution does not meaningfully change after these steps.

A more detailed description of methodology can be found under Survey Design.

Analysis

We began with descriptive statistics and unpaired t-tests to compare outcomes across device groups. We analyzed the full sample and examined Juniors and First-years separately, as these groups had the most balanced distribution of device types.

For the multivariate analysis, we used an ordered probit model. The outcome variable, the number of correct answers out of 10, is discrete, ordered, and bounded. This makes ordered probit more appropriate than ordinary least squares (OLS), as it accounts for the ordinal structure without assuming equal spacing between score values.

We estimated three sets of ordered probit models. The first set used the full analytical sample and included nine model specifications of increasing complexity. These ranged from a baseline model with gender and race controls to models that added age, device type, AI familiarity, social media use, and response time. Model selection was based on the Akaike Information Criterion (AIC), which balances model fit and complexity. The best performing model included gender, race, age over 25, and device type, and had the lowest AIC of 1507.66. We then estimated seven model specifications using only the faculty sample (n = 66). In this case, the best model included gender and race, along with age, device type, and AI familiarity. Finally, we estimated the same seven models using only the student sample (n=335). The best performing model included the baseline controls and device type.

Results

A linear regression of score on response time shows no relationship between time and performance (p = 0.946), suggesting that time spent on the task did not meaningfully affect accuracy. While not central to our analysis, it may still be interesting to explore whether there is a nonlinear effect or threshold in response time where performance changes.

In the full sample, a t-test shows a significant difference in performance by device. Laptop users scored higher on average (7.16) than mobile users (6.63), with this difference statistically significant (p = 0.0008). Subgroup analyses for juniors and first-years were not statistically significant, likely due to smaller sample sizes, since p-values noticeably decreased as sample size increased even over the course of the three t-tests.

Modeling

I. Full sample analysis
The best-fitting ordered probit model for the full sample includes gender, race, age, and device type. Model diagnostics indicate evenly spaced cut points and no issues with multicollinearity. Gender and race were included as controls and their marginal effects are not interpreted. The model shows that being over 25 is statistically significant (alpha = 0.05) and associated with lower performance. Specifically, individuals over 25 are 3 to 5 percentage points less likely to score 8, 9, or 10, and 1 to 2 percentage points more likely to score between 4 and 6. Device type is not statistically significant in this specification.

II. Faculty only
In the faculty-only sample, the best model includes controls (race and gender), age, device type, and AI familiarity. Model diagnostics indicate no multicollinearity. Cut-point analysis shows a notable jump between cut points 7 and 8, implying that for faculty, moving from a score of 9 to 10 required the largest increase in the underlying latent variable. In log-likelihood, AI familiarity is positive and statistically significant (coefficient = 0.439, p = 0.036), indicating higher performance among more AI-familiar participants. Marginal effects show that AI familiarity is associated with lower probabilities of scores in the 2 to 6 range and higher probabilities of scores in the 7 to 10 range, with moderate significance (alpha = 0.1). Device type is not statistically significant in this sample.

III. Student only
In the student-only sample, the best model includes baseline controls and device type. Device type is statistically significant, with a negative log-likelihood coefficient for mobile phone use (coefficient = -0.29, p = 0.007). The cut points are fairly evenly spaced. Many marginal effects for device are statistically significant at the 0.05 level. Notably, the marginal effects are negative for higher scores (7–10) and positive for lower scores (2–6), indicating that using a phone increases the probability of lower scores and decreases the probability of higher scores.

IV. Summary
The key predictors shift noticeably across sub-samples. In the full sample, age (over 25 or not) stands out as the main driver of performance. However, once the data are split, a different pattern emerges. Among students, device type becomes a strong predictor, with mobile use linked to lower scores. In contrast, among faculty, AI familiarity is the dominant factor and is associated with higher performance.

Limitations

Device-age confound

A key limitation of this study is the confounding between device type, class year, and age. Device type was not randomly assigned. Recruitment used class-year email lists, with first-years and juniors instructed to complete the survey on mobile devices and sophomores and seniors instructed to use laptops, while faculty and staff received mixed instructions. As a result, device type is structurally correlated with class year and age by design rather than by chance. This makes it difficult to separate the effect of device from the effects of age and class year. This is reflected in the ordered probit results, where device type loses statistical significance once age is included as a control.

Self-selection bias

There is also self-selection bias. Although participants were given instructions suggesting a preferred device by group, they could choose which device to use to complete the survey. This introduces variation in device choice that is not fully controlled by the study design.

Survey framing effects

Another limitation is survey framing effects. Participants were explicitly told that the study involved distinguishing real from AI-generated images. This likely increased attention and caution compared to real-world settings, where people encounter images without being prompted to evaluate their authenticity.

Image set constraints

Finally, the image set itself imposes constraints. The images were not fully up to date, varied in difficulty, and each participant evaluated only 10 images. This limited exposure reduces how well the task reflects real-world image detection ability.

Future directions

This study shows that there is a device-related difference in performance in the full sample and that age is associated with performance. However, it does not establish causality. Device type was partly influenced by the class-year recruitment design rather than being randomly assigned, which means device type, class year, and age are correlated by construction.

A randomized controlled trial is needed to identify the causal effect of device type. In such a design, participants would be randomly assigned to a device condition rather than choosing their own device. Using standardized hardware would also help control for differences in screen size and display quality, allowing for a cleaner estimate of the effect of device type on performance.

Acknowledgments

This project was completed as part of DCS3850: Advanced Data Science at Bowdoin College, Spring 2026. We thank all 508 participants who took the time to complete the survey. And a big thanks to Prof. V for her help and guidance throughout the process!

Team

Lulu Linkas

Statistical analysis, survey design, appendix

Maddy Ohta

T-tests, exploratory findings, write-up

Seamus Woodruff

Website, data pipeline, visualisations, image curation

Appendix

GitHub

Full appendix and analysis code

github.com/llinkas11/ai-image-detection

About our study