INTRODUCTION

The stock market is a fascination for many Americans, with both independent investors and funds constantly scrambling to know how a stock price will change next. Since its opening over 200 years ago, the American stock market has experienced numerous rises and crashes that have had tremendous effects on both the American economy and investors’ wealth. For over a century, overall stock trends were extremely difficult to track. However, as we entered the 20th century, index funds were introduced, such as the S&P 500, that allowed the average investor to see market performance and change where they wanted to invest.

Our analysis focuses specifically on the S&P 500 index fund and how its performance indicates future changes in market behavior. Through using both Kaggle data as well as a created web scraper, our data allows us to see the adjusted close price of each stock in the fund every day since 2015. We also have company information, such as the location, sector, and weight in the fund, that gives us further information about each stock. Through our EDA analysis, we found some interesting correlations between location and price, as well as how sector factored into each of those variables. We used these discoveries to come up with our two modeling questions.

  1. What stocks or companies performed the best relative to expectations in 2024?
  2. When a certain sector has a change in mean stock price, how does that influence or predict the mean value of another sector’s stocks?

To explore question 1, we performed time series modeling to explore which variables most heavily influenced a stock’s growth over time. This research is extremely relevant for any investor looking to sell or buy a stock based on its projected change over a certain time period. For question 2, we did model fitting of mean sector prices to look at the interactions and correlations between sector performance. This information is relevant for any company concerned with how their business relies on other industries, or for investors looking to further invest in a sector based on the performance of another.

As with most data concerning stock prices over time, our modeling focuses on predicting growth or future prices to help investors spend their money wisely. However, unlike most ambitious data scientists looking to make a fortune overnight, our analysis takes an interesting approach of considering interactions between industries, location effects, and long-term trend forecasting to ensure our readers truly get rich quickly.

DATA

Our data, which is mostly scraped from YahooFinance by the Kaggle user LARXEL, contains three data sets. The first consists of company data such as stock symbol, sector, market cap, revenue growth, number of employees, and main location as of December 20th, 2024. Since many of the numerical variables change over time, we focused on using this dataset to identify categorical company information for our modeling. The second csv contains overall S&P 500 values for every day dating back to late 2014. This data was used mostly in our exploratory analysis and to gauge the overall growth of the index compared to the stocks and sectors that we were modeling. Finally, the third csv contained data surrounding stock price highs, lows, and adjusted returns for each stock in the index over the last 14 years. However, we found this data to be incomplete, as it was missing values for a variety of different stocks over the years that it spanned. Therefore, we set up a web scraper to pull together values for each stock dating back to 2015. These were all scraped from YahooFinance and included the stocks’ adjusted close as well as high and low values for each day. While at first we were uncertain about the validity of the data due to discussion comments under the Kaggle page, we verified information for each company and used our scraper to ensure that we had correct values for each stock price. An example subset of our data is shown below:

Example Rows from Dataset
date year symbol shortname adj_close low target yesterday_adj_close three_day_avg five_day_avg seven_day_avg fourteen_day_avg sector fulltimeemployees percent_change index index_change diff
2016-01-04 2016 A Agilent Technologies, Inc.  37.79 40.34 37.66 38.83 39.04333 38.972 38.81857 38.22429 Healthcare 17400 -0.0267834 2012.66 -0.0153038 -0.0114796
2016-01-05 2016 A Agilent Technologies, Inc.  37.66 40.34 37.83 37.79 38.56000 38.724 38.76286 38.26429 Healthcare 17400 -0.0034401 2016.71 0.0020123 -0.0054523
2016-01-06 2016 A Agilent Technologies, Inc.  37.83 40.05 36.22 37.66 38.09333 38.516 38.61571 38.29143 Healthcare 17400 0.0045141 1990.26 -0.0131154 0.0176295
2016-01-07 2016 A Agilent Technologies, Inc.  36.22 38.81 35.84 37.83 37.76000 38.234 38.44429 38.29357 Healthcare 17400 -0.0425588 1943.09 -0.0237004 -0.0188584
2016-01-08 2016 A Agilent Technologies, Inc.  35.84 38.47 35.24 36.22 37.23667 37.666 38.09000 38.14000 Healthcare 17400 -0.0104914 1922.03 -0.0108384 0.0003470

The company csv, which we relied heavily on when modeling similar trends between sectors, contains 16 variables. The main ones we used were company sector, which identifies the sector the company is in, full-time employees, and date. In our modeling, we also created a few more variables to measure change in both price and index. We calculated the three-day average by taking a stock’s average adjusted close price over the previous three days. We used the same process to create each stock’s five-day, seven-day, and fourteen-day average. We also created index change and percent change variables by taking the value of either the index or the stock and subtracting it from the respective value on the previous day. We then calculated a variable called diff by subtracting the index change from the percent change. Each of these was used in our first modeling question to predict a variable called target, which represents tomorrow’s adjusted close. The following graph shows the overall trend of the stock market since 2015, which we are trying to model.

RESULTS

To answer our first question concerning the best-performing stocks relative to expectations in 2024, we created models to predict the target of a stock based on several factors, including both company data and stock averages and returns. Our base model simply predicts the target value, tomorrow’s stock price, based on yesterday’s adjusted close price. We then began adding additional predictors to the linear regression with variables such as three, five, seven, and fourteen-day averages. We also included low values for each price, values for the variable diff, and company information, such as sector and number of employees. After including some linear models with interactions between the two variables, we turned to a lasso approach for our last linear model, where we found that adjusted close, yesterday’s adjusted close, and seven-day average were the only variables with non-zero lasso coefficients. The models produced were:

Models

We tested more linear models, but we dropped about half of them because the results were redundant. These models included different combinations and interactions of the variables used in the models above. In addition, we also used the xgboost library in R to apply an XGBoost model to our training data. An XGBoost model is a machine learning model that applies a series of decision trees and combines them to create a prediction for a variable. The XGBoost model didn’t outperform any of the above linear models, so we chose not to include it. To thoroughly compare the validity and effectiveness of these models, we split our data into training and testing datasets, with models being trained using data from 2015 to 2023, and then testing on data from 2024. We used both mean absolute error and root mean squared error to compare our models.

Model Comparison
Model MAE RMSE
lm1 2.43 7.97
lm2 2.43 7.97
lm4 2.43 8.01
lm11 2.43 7.97
lm12 2.43 8.01
lm14 2.43 7.97
lm15 2.43 7.97
lm16 2.43 7.97
lm29 2.43 7.97
lm30 2.43 7.97
lm31 2.43 7.94
lasso 2.43 7.97
simple 2.44 7.98
lm3 2.44 7.97
lm7 2.44 8.06
lm17 2.44 7.97
lm28 2.45 8.03
lm6 4.05 12.18

While we tested numerous models to predict the target price for a stock, none seemed to perform significantly better than our simple model that predicted the target off yesterday’s adjusted close alone. While this result does bring into question the importance of other factors, we can see that the model is somewhat effective in predicting the target. There is a clear linear relationship between yesterday’s and today’s adjusted close value.

Using the simple model, we created a metric, Average Return Over Expected, which measures how stocks actually performed vs how they were predicted. The two tables below show the best and worst companies of 2024 in this metric.

Best Performing Companies of 2024
Symbol Name Sector Average Return Over Expected
PLTR Palantir Technologies Inc.  Technology 0.683%
VST Vistra Corp.  Utilities 0.581%
GEV GE Vernova Inc.  Utilities 0.488%
NVDA NVIDIA Corporation Technology 0.463%
UAL United Airlines Holdings, Inc.  Industrials 0.385%
Worst Performing Companies of 2024
Symbol Name Sector Average Return Over Expected
WBA Walgreens Boots Alliance, Inc.  Healthcare -0.330%
MRNA Moderna, Inc.  Healthcare -0.326%
INTC Intel Corporation Technology -0.285%
DLTR Dollar Tree, Inc.  Consumer Defensive -0.220%
EL Estee Lauder Companies, Inc. (T Consumer Defensive -0.216%

At first glance, there seems to be some possible relationship between sector and Average Return Over Expected. Technology and Utilities both have two companies in the top five, while the Healthcare and Consumer Defensive sectors each have two in the bottom five. We investigated this further, grouping companies by sector and comparing the grouped Average Return Over Expected.

Despite our initial suspicions, no sector stands out as having a significantly better or worse year than expected. Every sector is extremely close to zero, and the difference between the best and worst sector is about 0.1%. The same trend appears with the results of the Average Return Over Expected of the individual companies. No company had an Average Return Over Expected of greater than 1.0% or less than -1.0%, and the difference between the best and worst performing companies in 2024 was just 1.013%. We hypothesize that a year is simply too long for a company to consistently outperform or under perform expectations, as it gives the market enough time to correct itself.

For our second question, which explored the relationships between the performances of different sectors, we created a new data frame by joining the stock prices and sector information, and then reshaping the data into a time series with each entry corresponding to a date, and each column corresponding to a sector. Each cell corresponds to the mean price of the sector on that date. Doing feature engineering in this way makes it possible to use previous linear modeling techniques to model the relationship between a dependent mean sector price and the other mean sector prices, meaning we can also form predictions of a sector price based on other sector prices. To find any relationships between sectors, we ran a for loop fitting each mean sector price as Y to all other mean sector prices as X with LASSO and then plotted the predictions. The coefficients for each were saved to a matrix to display later. After all models were fitted, the resulting coefficient matrix was printed. A few samples of the predicted vs actual and residual plots are selected.

At a glance, the residual plots display lots of striping. This can imply that there are subgroups within the data that we aren’t capturing correctly. To see what grouping is causing the striping, we recolored the plots by year.

Since stocks go up per calendar year and we stripped date as a predictor, the overall model could not account for time, and time is intuitively one of the biggest market performance factors.

To avoid too much increased model complexity (we could try a time series model with raw prices and consider time, or we could find a way to flatten the data across years), we decided to remove the impact of time by calculating daily returns as opposed to the raw price itself. After doing so, the scatter plot looked much better. We can even see the volatility in the market from the width of the spread of the X axis for actual values (meaning greater range of daily returns), and which years appear to be more or less volatile (2024 being less volatile, while 2016-2017 being more volatile)

Just to check for trends, we tried performing the same on the log of prices. The shape was similar overall, so we decided to stick with the non-log correlation matrix for better interpret ability.

With that said, our final correlation matrix looks like this. Many of these connections are more intuitive than our previous plot.

For instance, the relationship between Basic Materials and Industrials makes sense, as does Communication Services and Technology, because each of these pairs of sectors shares some overlap in company function. Of course, data does not and should not always be in line with prior expectations. However, in this particular case, it is reasonable to assume we are closer to the truth than the previous iteration of the model. These results have value to investors, particularly those who prefer more short-term investments. If, for example, stocks in the Utilities sector rise, investors would be wise to purchase stocks in the Consumer Defensive and Real Estate sectors, as we’d expect those prices to rise as well. Different sectors of the stock market are linked with each other, as we can see a fair amount of sector relationships with relatively strong correlations. Another interesting, yet unsurprising, result is that positive correlations dominate the correlation matrix. As previously mentioned, many sectors are related, and the stock market tends to rise and fall together. There is less evidence that an investor should buy stock in a certain sector when another one struggles, as the few negative correlations are weak.

CONCLUSION

Our deep analysis of the S&P 500 delved into gaining more insights into stock behavior and sector interactions across nearly a decade of data collected. For our first modeling question, we found that while one can build many complicated models to predict stock prices, the power of yesterday’s stock closing price cannot be overstated. No model we tested, including models using complex statistical and machine learning concepts, was able to significantly outperform this simple model, reinforcing the idea that in markets, parsimony can sometimes beat sophistication. Similarly, in examining sectoral interactions, we were able to simply reshape our data into a time series, where we could then apply the usual linear regression modeling process to find correlations between sector prices. Our findings suggested that multiple sectors had positive or negative relationships with each other, supporting the intuition that changes in sector mean prices could be correlated with the performance of other sectors. This is important for both investors and businesses because it demonstrates how industries are interconnected and how careful monitoring of one sector’s trends could offer predictive power over another’s.

The conclusions of our project are extremely relevant to everyday stock market analysis. Analysts might be spending too much time analyzing variables that don’t have a significant effect on a stock’s immediate future, and might be giving potential investors poor advice. Our findings indicate that an accurate prediction of a stock’s future performance can be made by simply looking at how a stock has performed recently and how its sector and related sectors are trending. The results of complicated models can be hard to explain to an uneducated investor, and the improvement these models add is marginal. Our analysis could also provide value to investors seeking short-term gain by purchasing stocks of correlated sectors when one sector experiences growth in its stock prices. Day-trading and other forms of short-term investing are notoriously very risky investments, and our findings can offer assistance to these investors.

Given how important the stock market is, our analysis has several avenues for future direction. Building off our findings from the first question, we know that a year is likely too long to find companies that truly differ from expectations, due to the properties of a market that recorrects itself. One interesting idea to further this research is to define a threshold of a stock differing from expectation, say 1.0%, and find the maximum length of time that a stock tends to stay above this threshold. In the dataset we used, we had variables such as market cap, employees, and revenue growth for each current S&P500 company at the end of 2024. If we had that data continuously updated each day, like we had with stock prices, we may have been able to build better models. Instead, using those variables didn’t make sense because a company’s market cap in 2024 can’t be used to predict its stock price in 2016.

Overall, our analysis provides several useful and actionable conclusions and is a solid baseline for future research.