Sampling on Datasets

Once you’ve created a dataset, you want to explore the values inside.

Exploring very large datasets can be difficult, as even simple operations can be expensive, both in terms of computational resources and time. The same sampling principle applies to visualization (Charts), data preparation, and statistical analyses (Statistics). The main purpose of sampling is to provide immediate visual feedback while exploring and preparing the dataset, no matter how large it may be.

Although taking the first 10,000 rows is the fastest sampling method, the sample may be biased depending on the composition of the dataset.

Depending on your needs, there are a number of different sampling methods available, such as random, stratified, or class rebalancing, to name a few.

Some common sampling methods include:

  • Random Sampling: Selecting data points randomly from the dataset, giving each data point an equal chance of being chosen.

  • Stratified Sampling: Dividing the dataset into different strata or groups based on certain characteristics and then randomly sampling from each stratum. This ensures representation from each subgroup.

  • Systematic Sampling: Selecting every nth item from the dataset after an initial random start. This method is useful when there is an inherent order or structure in the data.

  • Cluster Sampling: Dividing the dataset into clusters, randomly selecting some clusters, and then sampling all data points within the selected clusters.

Example: Imagine you want to estimate the number of cars in a large car park that spans 50 acres. Instead of counting all the cars, you can count the number of cars in 1 acre and multiply by 50 to estimate the total. Alternatively, you could count the cars in half an acre and multiply by 100 for the same purpose.