dataframe: A fast, safe, and intuitive DataFrame library.

[ data, gpl, library, program ] [ Propose Tags ] [ Report a vulnerability ]

A fast, safe, and intuitive DataFrame library for exploratory data analysis.


[Skip to Readme]

Downloads

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees

Candidates

  • No Candidates
Versions [RSS] 0.1.0.0, 0.1.0.1, 0.1.0.2, 0.1.0.3, 0.2.0.0, 0.2.0.1, 0.2.0.2, 0.3.0.0, 0.3.0.1
Change log CHANGELOG.md
Dependencies array (>=0.5 && <0.6), attoparsec (>=0.12 && <=0.14.4), base (>=4.17.2.0 && <4.22), bytestring (>=0.11 && <=0.12.2.0), containers (>=0.6.7 && <0.8), directory (>=1.3.0.0 && <=1.3.9.0), filepath (>=1.0.0.0 && <=1.5.4.0), hashable (>=1.2 && <=1.5.0.0), random (>=1 && <=1.3.1), statistics (>=0.16.2.1 && <0.16.3.0), template-haskell (>=2.0 && <=2.30), text (>=2.0 && <=2.1.2), time (>=1.12 && <=1.14), vector (>=0.13 && <0.14), vector-algorithms (>=0.9 && <0.10), zstd (>=0.1.2.0 && <=0.1.3.0) [details]
Tested with ghc ==9.8.3 || ==9.6.6 || ==9.4.8 || ==9.10.1 || ==9.12.1 || ==9.12.2
License GPL-3.0-or-later
Copyright (c) 2024-2024 Michael Chavinda
Author Michael Chavinda
Maintainer mschavinda@gmail.com
Category Data
Bug tracker https://github.com/mchav/dataframe/issues
Source repo head: git clone https://github.com/mchav/dataframe
Uploaded by mchav at 2025-07-29T03:55:59Z
Distributions LTSHaskell:0.2.0.2, NixOS:0.1.0.3, Stackage:0.3.0.0
Executables dataframe, one_billion_row_challenge, california_housing, chipotle
Downloads 58 total (20 in the last 30 days)
Rating (no votes yet) [estimated by Bayesian average]
Your Rating
  • λ
  • λ
  • λ
Status Docs available [build log]
Last success reported on 2025-07-29 [all 1 reports]

Readme for dataframe-0.3.0.1

[back to package description]

dataframe logo

hackage Latest Release C/I

User guide | Discord

DataFrame

A fast, safe, and intuitive DataFrame library.

Why use this DataFrame library?

  • Encourages concise, declarative, and composable data pipelines.
  • Static typing makes code easier to reason about and catches many bugs at compile time—before your code ever runs.
  • Delivers high performance thanks to Haskell’s optimizing compiler and efficient memory model.
  • Designed for interactivity: expressive syntax, helpful error messages, and sensible defaults.
  • Works seamlessly in both command-line and notebook environments—great for exploration and scripting alike.

Example usage

Interactive environment

Screencast of usage in GHCI

Key features in example:

  • Intuitive, SQL-like API to get from data to insights.
  • Create typed, completion-ready references to columns in a dataframe using :exposeColumns
  • Type-safe column transformations for faster and safer exploration.
  • Fluid, chaining API that makes code easy to reason about.

Standalone script example

-- Useful Haskell extensions.
{-# LANGUAGE OverloadedStrings #-} -- Allow string literal to be interpreted as any other string type.
{-# LANGUAGE TypeApplications #-} -- Convenience syntax for specifiying the type `sum a b :: Int` vs `sum @Int a b'. 

import qualified DataFrame as D -- import for general functionality.
import qualified DataFrame.Functions as F -- import for column expressions.

import DataFrame ((|>)) -- import chaining operator with unqualified.

main :: IO ()
main = do
    df <- D.readTsv "./data/chipotle.tsv"
    let quantity = F.col "quantity" :: D.Expr Int -- A typed reference to a column.
    print (df
      |> D.select ["item_name", "quantity"]
      |> D.groupBy ["item_name"]
      |> D.aggregate [ (F.sum quantity)     `F.as` "sum_quantity"
                     , (F.mean quantity)    `F.as` "mean_quantity"
                     , (F.maximum quantity) `F.as` "maximum_quantity"
                     ]
      |> D.sortBy D.Descending ["sum_quantity"]
      |> D.take 10)

Output:

------------------------------------------------------------------------------------------
index |          item_name           | sum_quantity |    mean_quanity    | maximum_quanity
------|------------------------------|--------------|--------------------|----------------
 Int  |             Text             |     Int      |       Double       |       Int      
------|------------------------------|--------------|--------------------|----------------
0     | Chicken Bowl                 | 761          | 1.0482093663911847 | 3              
1     | Chicken Burrito              | 591          | 1.0687160940325497 | 4              
2     | Chips and Guacamole          | 506          | 1.0563674321503131 | 4              
3     | Steak Burrito                | 386          | 1.048913043478261  | 3              
4     | Canned Soft Drink            | 351          | 1.1661129568106312 | 4              
5     | Chips                        | 230          | 1.0900473933649288 | 3              
6     | Steak Bowl                   | 221          | 1.04739336492891   | 3              
7     | Bottled Water                | 211          | 1.3024691358024691 | 10             
8     | Chips and Fresh Tomato Salsa | 130          | 1.1818181818181819 | 15             
9     | Canned Soda                  | 126          | 1.2115384615384615 | 4 

Full example in ./examples folder using many of the constructs in the API.

Installing

Jupyter notebook

CLI

  • Run the installation script curl '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/mchav/dataframe/refs/heads/main/scripts/install.sh | sh
  • Download the run script with: curl --output dataframe "https://raw.githubusercontent.com/mchav/dataframe/refs/heads/main/scripts/dataframe.sh"
  • Make the script executable: chmod +x dataframe
  • Add the script your path: export PATH=$PATH:./dataframe
  • Run the script with: dataframe

What is exploratory data analysis?

We provide a primer here and show how to do some common analyses.

Coming from other dataframe libraries

Familiar with another dataframe library? Get started:

Supported input formats

  • CSV
  • Apache Parquet (still buggy and experimental)

Future work

  • Apache arrow compatability
  • Integration with common data formats (currently only supports CSV)
  • Support windowed plotting (currently only supports ASCII plots)
  • Host the whole library + Jupyter lab on Azure with auth and isolation.