~3 min read
AI-Assisted Investment Analytics: Why the First Answer Was Not Good Enough
A case study on AI boom stocks, 10-K disclosures, factor models, placebo tests, and the role of human judgment in investment analytics.
There is a tempting story in financial markets: after the launch of ChatGPT, companies connected to AI infrastructure massively outperformed the rest of the market. It sounds plausible. NVIDIA, GPUs, semiconductors, data centers, cloud infrastructure, AI capex — all of this became part of the market narrative after late 2022.
So I used it as an investment analytics case study. The first question was simple: did AI infrastructure stocks achieve higher returns after the start of the generative AI boom? A more interesting question appeared once the first results came in: how much of that result survives when you stop trusting the first answer?
This post is not about stock picking. It is about the analytical process: how AI can help build an investment analysis quickly, and why human judgment is still needed before anyone should rely on the result.
The setup
I analysed 41 US-listed companies, grouped into four baskets:
| Group | Examples | Rationale |
|---|---|---|
| AI infrastructure | NVDA, AMD, AVGO, TSM, ASML, AMAT, LRCX, KLAC, MU, MRVL, SMCI | Semiconductors, GPU supply chain, hardware and infrastructure exposure |
| AI platforms / cloud / software | MSFT, GOOGL, AMZN, META, ORCL, CRM, NOW, ADBE, IBM, SNOW | Companies monetising AI through platforms, cloud and software |
| Broad technology control | AAPL, CSCO, INTC, TXN, QCOM, INTU, ACN, PANW, CDNS, ADSK | Technology companies not treated as pure AI infrastructure exposure |
| Defensive control | KO, PEP, PG, WMT, JNJ, MRK, MCD, COST, CL, GIS | Lower-beta defensive comparison group |
Table: group definitions used throughout the post.
The event date was 30 November 2022, the public launch of ChatGPT. The analysis used daily adjusted prices from Yahoo Finance, SPY as the market benchmark, Treasury-bill data as a risk-free-rate proxy, and Fama-French factor data for the later validation step.
The project started with a standard investment analytics workflow:
- compare average returns relative to SPY,
- compare risk-return metrics across groups,
- test before-after performance around the ChatGPT launch,
- add CAPM adjustment,
- add placebo tests.
Then I extended it further:
- build a pre-event AI Exposure Score from SEC 10-K / 20-F filings,
- run cross-sectional regressions,
- test robustness across scoring methods and windows,
- validate the result using CAPM, Fama-French 3-factor and Carhart 4-factor models.
At first glance, the result looked strong. Then it got complicated.
The tempting first result
After the event date, the AI infrastructure portfolio beat SPY by a wide margin while the defensive portfolio lagged it.
| Portfolio | Mean monthly spread vs SPY |
|---|---|
| AI infrastructure | +2.64% |
| Defensive control | -1.09% |
Table: mean monthly spread vs SPY (post-event).
The difference was:
+3.73 percentage points per month
Welch p-value = 0.0079
95% CI = [+1.01%; +6.44%]
Hedges' g = 0.59
That looks statistically significant and economically meaningful. A careless version of the analysis could stop here and declare that AI infrastructure generated an AI boom premium. It would be too quick.
The problem with the first answer
The first result used a simple spread:
portfolio return - SPY return
This is intuitive, but it hides a strong assumption: it treats the portfolio as if its market beta were equal to 1. AI infrastructure stocks and defensive stocks do not have remotely similar risk profiles. In the data, the estimated post-event beta of the AI infrastructure portfolio was close to 1.98, against roughly 0.24 for the defensive portfolio. Part of the outperformance could simply be compensation for higher systematic risk.
In other words, the first result may have bundled together:
AI exposure
+ high beta
+ technology regime
+ market timing
+ ex-post stock selection
rather than capturing a clean AI-specific premium.
Adding financial discipline
The next step was to compare risk-return metrics across the four groups: annualized return and volatility, Sharpe and Sortino ratios, maximum drawdown, CAPM beta and CAPM alpha.
| Group | Annualized return | Volatility | Sharpe | Max drawdown | Beta | CAPM alpha |
|---|---|---|---|---|---|---|
| AI infrastructure | 36.2% | 50.1% | 0.67 | -58.4% | 1.85 | +11.9% |
| AI platforms | 7.2% | 37.4% | 0.15 | -56.7% | 1.30 | -10.7% |
| Broad tech | 11.1% | 34.9% | 0.23 | -45.2% | 1.26 | -6.4% |
| Defensive | 8.2% | 19.2% | 0.25 | -29.7% | 0.34 | +1.1% |
The AI infrastructure group still looked strong, with the highest return and the highest Sharpe Ratio. But now the risk was visible too: the highest volatility, very high market beta, and the deepest drawdown of all four groups.
For Sharpe Ratio, the ANOVA result was:
AI infrastructure Sharpe: 0.67
Other groups: around 0.15–0.25
ANOVA p-value = 0.005
For CAPM alpha:
AI infrastructure alpha: +11.9% annually
ANOVA p-value = 0.0004
Post-hoc tests showed that AI infrastructure had a significantly higher Sharpe Ratio than each of the other groups. For CAPM alpha, it was significantly higher than AI platforms and broad technology, but not than defensive stocks.
This was already an improvement on the simple spread. Instead of “AI infrastructure went up”, the data now said: AI infrastructure had a stronger historical risk-return profile, and the strength came with materially higher risk. A statement I could actually defend.
One problem remained, and it was a big one. The stock baskets were defined with today’s knowledge, which creates look-ahead bias.
The look-ahead problem
If I classify companies as “AI infrastructure winners” after seeing the AI boom, the analysis is contaminated: the portfolio may be selecting winners after the fact. That is not a small technicality. It changes the interpretation.
A cleaner question is whether companies that disclosed AI exposure before the ChatGPT launch later earned higher abnormal returns. That question led to the most important extension of the project: a pre-event AI Exposure Score.
Building a pre-event AI Exposure Score
Instead of relying only on hand-labelled groups, I built a company-level AI exposure measure from annual SEC filings available before 30 November 2022. For each company, I used the most recent 10-K filed before the event date; for foreign filers such as TSM and ASML, the 20-F. The idea was to measure what companies had already disclosed about AI before the market narrative exploded.
The main score was deterministic and reproducible. It used a conservative pre-2022 AI dictionary with terms such as:
- artificial intelligence
- machine learning
- deep learning
- neural network
- natural language processing
- computer vision
- predictive analytics
- autonomous systems
- recommendation system
- data science
- automation
- algorithmic decision-making
The score was normalized by document length, transformed, winsorized and converted into a percentile rank.
I also tested alternative versions:
- dictionary with GPU counted in an AI-related context,
- dictionary without “automation”,
- extended dictionary,
- raw mentions per 10,000 words,
- log-scaled score,
- z-score,
- LLM-based score.
This mattered because text-based exposure is not objective truth. It is a measurement model, and measurement models can fail.
What the disclosure data showed
The disclosure-based score produced an early warning: the companies that talked most about AI before the event were not the companies that later delivered the strongest market returns.
| Group | Average AI Exposure Score |
|---|---|
| AI platforms | 0.73 |
| Broad technology | 0.64 |
| AI infrastructure | 0.50 |
| Defensive | 0.18 |
Table: average deterministic AI exposure score (group means).
The market did not simply reward whoever had written the most about AI in their filings. AI platforms and software companies often had more explicit AI-related disclosure, yet the strongest performance came from infrastructure names: semiconductors, the GPU supply chain and data-center hardware. That weakens the simple story that more AI disclosure means a higher AI premium. What the market actually rewarded was the part of the AI value chain that became directly tied to compute demand and capex.
Cross-sectional regression: the key test
The central test was a cross-sectional regression:
6-month abnormal return
= AI Exposure Score
+ pre-event beta
+ pre-event volatility
+ pre-event momentum
+ log market cap
The dependent variable was the 6-month post-event abnormal return, from 30 November 2022 to 31 May 2023.
| Model | Controls | AI Exposure coefficient | Result |
|---|---|---|---|
| M1 | none | +0.259 | significant |
| M2 | beta + volatility | -0.009 | not significant |
| M3 | beta + volatility + momentum | +0.013 | not significant |
| M4 | beta + volatility + momentum + size | +0.044 | not significant |
In the simple model, AI Exposure looked predictive:
β1 = +0.259
p = 0.044
After adding beta and volatility, the effect disappeared:
M2–M4: p > 0.68
This was the single most important result of the project. The first version of the answer said that AI exposure explains the premium. The better version said that AI exposure is correlated with risk characteristics that explain much of the premium. The correlation matrix made the mechanism clear:
| Relationship | Correlation |
|---|---|
| AI Exposure vs pre-event beta | +0.50 |
| AI Exposure vs pre-event volatility | +0.48 |
| Pre-event beta vs 6-month abnormal return | +0.61 |
| Pre-event volatility vs 6-month abnormal return | +0.66 |
This does not mean AI exposure was irrelevant. It means the evidence for an independent AI-specific premium became much weaker once risk was controlled.
The uncomfortable robustness checks
I then tried to break the result, which is usually where an analysis starts paying for itself. The robustness checks included:
- leave-one-out regressions,
- excluding NVIDIA,
- excluding top-3 AI exposure firms,
- excluding top-3 post-event return firms,
- alternative AI dictionaries,
- dictionary without “automation”,
- LLM-based AI score,
- LLM missing-as-zero fallback,
- 3-month, 6-month and 12-month event windows,
- alternative scaling methods.
Most deterministic-score variants did not produce a significant coefficient after risk controls. Two exceptions appeared.
Exception 1: the LLM-based score
I used Claude Haiku as a validation layer. The model received excerpts from filings containing AI-related terms and classified each company’s AI exposure from 0 to 3:
0 = no AI exposure
1 = marginal mention
2 = important business component
3 = core business / strategic exposure
The LLM score produced a stronger result:
LLM score, n = 34:
β1 = +0.379
p = 0.002
But this sample excluded 7 companies where the deterministic pipeline found no AI-related excerpts, so I also tested a conservative fallback that replaced missing LLM scores with zero:
LLM missing-as-zero, n = 41:
β1 = +0.176
p = 0.041
This has to be read carefully. The LLM may capture semantic exposure that a dictionary misses; some companies describe AI-relevant products without ever writing “machine learning” or “artificial intelligence”. At the same time, it introduces model risk:
- prompt dependency,
- excerpt selection,
- model version dependency,
- self-selection of companies with AI-related text,
- weaker reproducibility than deterministic scoring.
So I do not treat the LLM score as the main result. I treat it as a diagnostic: different measurement models of “AI exposure” can change the conclusion, and that is worth knowing in itself.
Exception 2: the 12-month window
The main 6-month window did not support an independent deterministic AI Exposure effect after controls. The 12-month window did:
12-month window:
β1 = +0.254
p = 0.031
One plausible reading is that the market priced AI exposure gradually rather than overnight. The AI capex narrative did not fully form on 30 November 2022; it developed over subsequent quarters, through earnings calls, GPU demand, cloud investment announcements and data-center buildout. But the result is window-sensitive, so I treat it as evidence worth following rather than final proof.
Factor models: does the alpha survive?
The final validation step used standard asset-pricing models: CAPM, the Fama-French 3-factor model and the Carhart 4-factor model with momentum. For the equal-weighted AI infrastructure portfolio, the estimated annualized alpha remained positive:
| Model | AI infrastructure alpha | HAC p-value |
|---|---|---|
| CAPM | +8.7% | 0.41 |
| Fama-French 3 | +14.7% | 0.15 |
| Carhart 4 | +13.2% | 0.19 |
The alpha estimate was positive in all three models, and statistically significant in none of them. So I cannot honestly conclude that AI infrastructure generated a confirmed standalone alpha. The honest version: AI infrastructure had a positive estimated alpha, but the evidence was too noisy to separate a clean AI-specific premium from risk exposures and market regime effects. Less catchy, harder to attack.
The placebo problem did not disappear
The earlier before-after A/B test carried its own warning. For AI-exposed companies, I compared the average daily spread vs SPY before and after the event, and the main event result looked strong. Then I repeated the same procedure with a fake event date: 30 November 2021, one year before the ChatGPT launch.
A well-specified event study should find nothing around a fake date. Mine found a statistically significant effect, in the opposite direction.
That matters. The simple before-after method was picking up broad market regimes and technology-stock dynamics, not just the ChatGPT event. Results like this make an analysis less spectacular and more credible at the same time.
What AI did well
AI was genuinely useful throughout the project. It helped structure the analysis, translate a broad question into testable hypotheses, generate Python code, prepare LaTeX sections, suggest statistical procedures, point out missing robustness checks, and reframe conclusions after methodological criticism. The work moved much faster than it would have otherwise.
But speed is not correctness. AI produces a polished first answer very quickly, and it can make a weak analysis look more convincing than it is. That combination is dangerous.
What human judgment changed
Every important improvement came from questioning the first answer. The initial analysis was too confident: it treated the spread vs SPY as if it were already a clean abnormal return, which it was not. Reviewing it changed the project in five ways.
First, the question shifted from raw outperformance to risk-adjusted performance. Not “did AI infrastructure stocks go up more?” but “did they go up more than expected, given their risk profile?”
Second, the measurement moved from ex-post labels to pre-event data. Hand-labelled “AI winners” gave way to a disclosure-based score built from filings available before the event, which reduced (though did not eliminate) look-ahead bias.
Third, the analysis confronted model risk by comparing deterministic dictionary scores, LLM scores and alternative scaling methods. The conclusion turned out to depend partly on how “AI exposure” is measured.
Fourth, a single p-value stopped counting as a conclusion. Leave-one-out checks, placebo tests, alternative windows and factor models took its place.
Fifth, the causal story became a careful empirical statement. The final conclusion is less exciting and considerably more solid, and that is a trade worth making.
The final answer
The final answer is not “the ChatGPT launch caused a clean AI premium”. It is closer to this:
AI infrastructure stocks were historically very strong in the analysed period. They achieved higher raw returns and better risk-return metrics than comparison groups. However, much of the simple premium weakens after controlling for beta, volatility, momentum and factor exposures. The evidence for a standalone AI premium is mixed and specification-sensitive. LLM-based exposure and longer windows suggest there may still be a meaningful AI-related signal, but this is not clean causal proof.
That is the difference between producing an impressive chart and doing credible analytics.
Practical takeaway
For investment analytics, AI works well as an acceleration layer: it generates the first version of an analysis very quickly. The real value appears when the analyst brings domain knowledge, scepticism and methodological pressure to that draft. The workflow I trust most is:
AI-generated first draft
→ domain-aware critique
→ better measurement
→ statistical robustness checks
→ factor-model validation
→ more cautious interpretation
Not because it gives the most exciting conclusion, but because it gives one I can stand behind.
Why this matters beyond investing
This case is not only about AI stocks; it is a broader lesson for AI-assisted analytics. A model can help build a dashboard, write code, run tests and draft an interpretation. It cannot take responsibility off the analyst, who still has to ask:
- What assumption is hidden here?
- What would falsify this result?
- Is this a causal claim or only a descriptive pattern?
- Does the model control for the right type of risk?
- Are we selecting winners after the fact?
- Does the result survive a placebo test?
- Would another measurement method change the conclusion?
In this project, the first answer was not good enough. That was not a failure of AI. It was the normal path from a fast first draft to a result worth defending.
Disclaimer: This article is for educational purposes only. It is not investment advice or a recommendation to buy or sell any financial instrument.