Pandas Can’t Handle This: How ArcticDB Powers Massive Datasets
Python has grown to dominate data science, and its package Pandas has become the go-to tool for data analysis. It is great for tabular data and supports data files of up to 1GB if you have a large RAM. Within these size limits, it is also good with time-series data because it comes with some in-built support.
That being said, when it comes to larger datasets, Pandas alone might not be enough. And modern datasets are growing exponentially, whether they’re from finance, climate science, or other fields.
This means that, as of today, Pandas is a great tool for smaller projects or exploratory analysis. It is not great, however, when you’re facing bigger tasks or want to scale into production fast. Workarounds exist — Dask, Spark, Polars, and chunking are some of them — but they come with additional complexity and bottlenecks.
I faced this problem recently. I was looking to see whether there are correlations between weather data from the past 10 years, and stock prices of energy companies. The rationale here is there might be sensitivities between global temperatures and the stock price evolution of fossil fuel- and renewable energy companies. If one found such sensitivities, that would be a strong signal for Big Energy CEOs to start cutting their emissions in their own self-interest.
I obtained the stock price data quite easily through Yahoo! Finance’s API. I used 16 stocks and ETFs — seven fossil fuel companies, six renewables companies, and three energy ETFs — and their daily close over ten years between 2013 to 2023. That resulted in about 45,000 datapoints. That’s a piece of cake for Pandas.
Global weather data was an entirely different picture. First of all, it took me hours to download it through the Copernicus API. The API itself is amazing; the problem is just that there is so much data. I wanted worldwide daily temperature data between 2013 and 2023. The little problem with this is that, with weather stations at 721 points of geographical latitude and 1440 points of geographical longitude, you’re downloading and later processing close to 3.8 billion datapoints.
That’s a lot of datapoints. Worth 185 GB of space on my hard drive.
To evaluate this much data I tried chunking, but this overloaded my state-of-the-art computer. Iterating through that dataset one step at a time worked, but it took me half a day to process it every time I wanted to run a simple analysis.
The good news is that I’m quite well-connected in the financial services industry. I’d heard about ArcticDB a while back but had never given it a shot so far. It is a database which was developed at Man Group, a hedge fund where several contacts of mine work at.
So I gave ArcticDB a shot for this project — and I’m not looking back. I’m not abandoning Pandas, but for datasets in the billions I’ll choose ArcticDB over Pandas any day.
I should clarify two things at this point: First, although I know people at ArcticDB / Man Group, I’m not formally affiliated with them. I did this project independently and chose to share the results with you. Second, ArcticDB is not fully open-source. It is free for individual users within reasonable limits but has paid tiers for power users and corporations. I used the free version, which gets you pretty far—and well beyond the scope of this project actually.
With that out of the way, I’ll now show you how to set up ArcticDB and what its basic usage is. I’ll then go into my project and how I used ArcticDB in this case. You’ll also get to see some exciting results on the correlations I found between energy stocks and worldwide temperatures. I’ll follow with a performance comparison of ArcticDB and Pandas. Finally, I’ll show exactly when you’ll be better off using ArcticDB, and when you can safely use Pandas without worrying about bottlenecks.
ArcticDB For Novices
At this point, you might have been wondering why I’ve been comparing a data manipulation tool — Pandas — with a full-blown database. The truth is that ArcticDB is a bit of both: It stores data conveniently, but it also helps manipulating data. Some powerful perks of it include fast queries, versioning, and better memory management.
Installation and Setup
For Linux- and Windows users, getting ArcticDB is as simple as getting any other Python package:
pip install arcticdb # or conda install -c conda-forge arcticdb
For Mac users, things are a little more complicated. ArcticDB does not support Apple chips at this time. Here are two workarounds (I’m on a Mac, and after testing I chose the first):
Run ArcticDB inside a Docker container.
Use Rosetta 2 to emulate an x86 environment.
The second workaround works, but the performance is slower. It therefore wipes out some of the gains of using ArcticDB in the first place. Nevertheless, it is a valid option if you can’t or don’t want to use Docker.
To set up ArcticDB, you need to create a local instance in the following fashion:
import arcticdb as adb
library = adb.Arctic(“lmdb://./arcticdb”) # Local storage
library.create_library(“climate_finance”)
ArcticDB supports multiple storage backends like AWS S3, Mongo DB, and LMDB. This makes it very easy to scale into production without having to think about Data Engineering.
Basic Usage
If you know how to use Pandas, ArcticDB won’t be hard for you. Here’s how you’d read in a Pandas dataframe:
import pandas as pd
df = pd.DataFrame({“Date”: [“2024-01-01”, “2024-01-02”], “XOM”: [100, 102]})
df[“Date”] = pd.to_datetime(df[“Date”]) # Ensure Date column is in datetime format
climate_finance_lib = library[“climate_finance”]
climate_finance_lib.write(“energy_stock_prices”, df)
To retrieve data from ArcticDB, you’d proceed in the following fashion:
df_stocks = climate_finance_lib.read(“energy_stock_prices”).data
print(df_stocks.head()) # Verify the stored data
One of the coolest features about ArcticDB is that it provides versioning support. If you are updating your data frequently and only want to retrieve the latest version, this is how you’d do it:
latest_data = climate_finance_lib.read(“energy_stock_prices”, as_of=0).data
And if you want a specific version, you do this:
versioned_data = climate_finance_lib.read(“energy_stock_prices”, as_of=-3).data
Generally speaking, the versioning works as follows: Much like in Numpy, the index 0 (following as_of= in the snippets above) refers to the first version, -1 is the latest, and -3 is two versions before that.
Next Steps
Once you have a grip around how to handle your data, you can analyse your dataset as you always have done. Even while using ArcticDB, chunking can be a good way to reduce memory usage. Once you scale to production, its native integration with AWS S3 and other storage systems will be your friend.
Energy Stocks Versus Global Temperatures
Building my study around energy stocks and their potential dependence on global temperatures was fairly easy. First, I used ArcticDB to retrieve the stock returns data and temperature data. This was the script I used for obtaining the data:
import arcticdb as adb
import pandas as pd
# Set up ArcticDB
library = adb.Arctic(“lmdb://./arcticdb”) # Local storage
library.create_library(“climate_finance”)
# Load stock data
df_stocks = pd.read_csv(“energy_stock_prices.csv”, index_col=0, parse_dates=True)
# Store in ArcticDB
climate_finance_lib = library[“climate_finance”]
climate_finance_lib.write(“energy_stock_prices”, df_stocks)
# Load climate data and store (assuming NetCDF processing)
import xarray as xr
ds = xr.open_dataset(“climate_data.nc”)
df_climate = ds.to_dataframe().reset_index()
climate_finance_lib.write(“climate_temperature”, df_climate)
A quick note about the data licenses: It is permitted to use all this data for commercial use. The Copernicus license allows this for the weather data; the yfinance license allows this for the stock data. (The latter is a community-maintained project that makes use of Yahoo Finance data but is not officially part of Yahoo. This means that, should Yahoo at some point change its stance on yfinance—right now it tolerates it—I’ll have to find another way to legally get this data.)
The above code does the heavy lifting around billions of datapoints within a few lines. If, like me, you’ve been battling data engineering challenges in the past, I would not be surprised if you feel a little baffled by this.
I then calculated the annual temperature anomaly. I did this by first computing the mean temperature across all grid points in the dataset. I then subtracted this from the actual temperature each day to determine the deviation from the expected norm.
This approach is unusual because one would usually calculate the daily mean temperature over 30 years of data in order to help capture unusual temperature fluctuations relative to historical trends. But since I only had 10 years of data on hand, I feared that this would muddy the results to the point where they’d be statistically laughable; hence this approach. (I’ll follow up with 30 years of data — and the help of ArcticDB — in due time!)
Additionally, for the rolling correlations, I used a 30-day moving window to calculate the correlation between stock returns and my somewhat special temperature anomalies, ensuring that short-term trends and fluctuations were accounted for while smoothing out noise in the data.
As expected and to be seen below, we get two bumps — one for summer and one for winter. (As mentioned above, one could also calculate the daily anomaly, but this usually requires at least 30 years’ worth of temperature data — better to do in production.)
Global temperature anomaly between 2013 and 2023. Image by author
Global temperature anomaly between 2013 and 2023. Image by author
I then calculated the rolling correlation between various stock tickers and the global average temperature. I did this by computing the Pearson correlation coefficient between the daily returns of each stock ticker and the corresponding daily temperature anomaly over the rolling window. This method captures how the relationship evolves over time, revealing periods of heightened or diminished correlation.A selection of this can be seen below.
On the whole, one can see that the correlation changes often. However, one can also see that there are more pronounced peaks in the correlation for the featured fossil fuel companies (XOM, SHEL, EOG) and energy ETFs (XOP). There is significant correlation with temperatures for renewables companies as well (ORSTED.CO, ENPH), but it remains within stricter limits.
Correlation of selected stocks with global temperature anomaly, 2013 to 2023. Image by author
Correlation of selected stocks with global temperature anomaly, 2013 to 2023. Image by author
This graph is rather busy, so I decided to take the average correlation with temperature for several stocks. Essentially this means that I used the average over time of the daily correlations. The results are rather interesting: All fossil fuel stocks have a negative correlation with the global temperature anomaly (everything from XOM to EOG below).
This means that when the anomalies increase (i.e., there is more extreme heat or cold) the fossil stock prices decrease. The effect is significant but weak, which suggests that global average temperature anomalies alone might not be the primary drivers of stock price movements. Nevertheless, it’s an interesting observation.
Most renewables stocks (from NEE to ENPH) have positive correlations with the temperature anomaly. This is somewhat expected; if temperatures get extreme, investors might start thinking more about renewable energy.
Energy ETFs (XLE, IXC, XOP) are also negatively correlated with temperature anomalies. This is not surprising because these ETFs often contain a large amount of fossil fuel companies.
Average correlation of selected stocks with temperature anomaly, 2013–2023. Image by author
Average correlation of selected stocks with temperature anomaly, 2013–2023. Image by author
All these effects are significant but small. To take this analysis to the next level, I will:
Test the regional weather impact on selected stocks. For example, cold snaps in Texas might have outsized effects on fossil fuel stocks. (Luckily, retrieving such data subsets is a charm with ArcticDB!)
Use more weather variables: Aside from temperatures, I expect wind speeds (and therefore storms) and precipitation (droughts and flooding) to affect fossil and renewables stocks in distinct ways.
Using AI-driven models: Simple correlation can say a lot, but nonlinear dependencies are better found with Bayesian networks, random forests, or deep learning techniques.
These insights will be published on this blog when they’re ready. Hopefully they can inspire the one or other Big Energy CEO to reshape their sustainability strategy!
ArcticDB Versus Pandas: Performance Checks
For the sake of this article, I went ahead and painstakingly re-ran my codes just in Pandas, as well as in a chunked version.
We have four operations pertaining to 10 years of stock- and of climate data. The table below shows how the performances compare with a basic Pandas setup, with some chunking, and with the best way I could come up with using ArcticDB. As you can see, the setup with ArcticDB is easily five times faster, if not more.
Pandas works like a charm for a small dataset of 45k rows, but loading a dataset of 3.8 billion rows into a basic Pandas setup is not even possible on my machine. Loading it through chunking also only worked with more workarounds, essentially going one step at a time. With ArcticDB, on the other hand, this was easy.
In my setup, ArcticDB sped the whole process up by an order of magnitude. Loading a very large dataset was not even possible without ArcticDB, if major workarounds were not employed!
Source: Ari Joury / Wangari – GlobalCreated with Datawrapper
When To Use ArcticDB
Pandas is great for relatively small, exploratory analyses. However, when performance, scalability, and quick data retrieval become mission-critical, ArcticDB can be an amazing ally. Below are some cases in which ArcticDB is worth a serious consideration.
When Your Dataset is Too Large For Pandas
Pandas loads everything into RAM. Even with an excellent machine, this means that datasets above a few GB are bound to crash. ArcticDB also works with very wide datasets spanning millions of columns. Pandas often fails at this.
When You’re Working With Time-Series Data
Time-series queries are common in fields like finance, climate science, or IoT. Pandas has some native support for time-series data, but ArcticDB features faster time-based indexing and filtering. It also supports versioning, which is amazing for retrieving historical snapshots without having to reload an entire dataset. Even if you’re using Pandas for analytics, ArcticDB speeds up data retrieval, which can make your workflows much smoother.
When You Need a Production-Ready Database
Once you scale to production, Pandas won’t cut it anymore. You’ll need a database. Instead of thinking long and deep about the best database to use and dealing with plenty of data engineering challenges, you can use ArcticDB because:
It easily integrates with cloud storage, notably AWS S3 and Azure.
It works as a centralized database even for large teams. In contrast, Pandas is just an in-memory tool.
It allows for parallelized reads and writes.
It seamlessly complements analytical libraries like NumPy, PyTorch, and Pandas for more complex queries.
The Bottom Line: Use Cool Tools To Gain Time
Without ArcticDB, my study on weather data and energy stocks would not have been possible. At least not without major headaches around speed and memory bottlenecks.
I’ve been using and loving Pandas for years, so this is not a statement to take lightly. I still think that it’s great for smaller projects and exploratory data analysis. However, if you’re handling substantial datasets or if you want to scale your model into production, ArcticDB is your friend.
Think of ArcticDB as an ally to Pandas rather than a replacement — it bridges the gap between interactive data exploration and production-scale analytics. To me, ArcticDB is therefore a lot more than a database. It is also an advanced data manipulation tool, and it automates all the data engineering backend so that you can focus on the truly exciting stuff.
One exciting result to me is the clear difference in how fossil and renewables stocks respond to temperature anomalies. As these anomalies increase due to climate change, fossil stocks will suffer. Is that not something to tell Big Energy CEOs?
To take this further, I might focus on more localized weather and go beyond temperature. I’ll also go beyond simple correlations and use more advanced techniques to tease out nonlinear relationships in the data. (And yes, ArcticDB will likely help me with that.)
On the whole, if you’re handling large or wide datasets, lots of time series data, need to version your data, or want to scale quickly into production, ArcticDB is your friend. I’m looking forward to exploring this tool in more detail as my case studies progress!
Originally published at https://wangari.substack.com.
The post Pandas Can’t Handle This: How ArcticDB Powers Massive Datasets appeared first on Towards Data Science.