Skip to content

API Reference

Auto-generated documentation from Python docstrings.

Data Generation

hellocloud.generation.WorkloadPatternGenerator

Generate realistic cloud workload patterns based on research data

generate_time_series(workload_type, start_time, end_time, interval_minutes=5)

Generate time series data for a specific workload type

hellocloud.generation.WorkloadType

Bases: Enum

Different application workload patterns based on research

TimeSeries API

hellocloud.timeseries.TimeSeries

Wrapper around PySpark DataFrame for hierarchical time series analysis.

Attributes:

Name Type Description
df

PySpark DataFrame containing time series data

hierarchy

Ordered list of key columns (coarsest to finest grain)

metric_col

Name of the metric/value column

time_col

Name of the timestamp column

__init__(df, hierarchy, metric_col, time_col)

Initialize TimeSeries wrapper.

Parameters:

Name Type Description Default
df DataFrame

PySpark DataFrame with time series data

required
hierarchy list[str]

Ordered key columns (e.g., ["provider", "account", "region"])

required
metric_col str

Name of metric column (e.g., "cost")

required
time_col str

Name of timestamp column (e.g., "date")

required

Raises:

Type Description
TimeSeriesError

If required columns missing from DataFrame

aggregate(grain)

Aggregate metric to coarser grain level.

Sums metric values across entities, grouping by grain + time.

Parameters:

Name Type Description Default
grain list[str]

Column names defining target grain (must be subset of hierarchy)

required

Returns:

Type Description
TimeSeries

New TimeSeries aggregated to specified grain

Example

Aggregate from account+region to just account

ts.aggregate(grain=["account"])

cost_summary_by_grain(grain, sort_by='total')

Compute summary statistics for cost at specified grain.

For each entity at the grain, computes: - total_cost: Sum across all time - mean_cost: Average daily cost - median_cost: Median daily cost - std_cost: Standard deviation (volatility) - min_cost: Minimum daily cost - max_cost: Maximum daily cost - days: Number of days with data

Parameters:

Name Type Description Default
grain list[str]

Dimension(s) to analyze (e.g., ['region'] or ['provider', 'region'])

required
sort_by str

Sort by 'total', 'mean', 'volatility' (std), or 'median'

'total'

Returns:

Type Description

PySpark DataFrame with summary statistics, sorted descending

Example

Top regions by total cost with volatility stats

stats = ts.cost_summary_by_grain(['region']) stats.show(10)

filter(**entity_keys)

Filter to specific entity by hierarchy column values.

Parameters:

Name Type Description Default
**entity_keys

Column name/value pairs to filter on (must be columns in hierarchy)

{}

Returns:

Type Description
TimeSeries

New TimeSeries with filtered DataFrame

Raises:

Type Description
TimeSeriesError

If filter column not in hierarchy

Example

ts.filter(provider="AWS", account="acc1")

filter_time(start=None, end=None, before=None, after=None)

Filter time series to specified time range.

Supports multiple filtering styles: - Range filtering: start/end (inclusive start, exclusive end) - Single-sided: before (exclusive) or after (inclusive)

Parameters:

Name Type Description Default
start str | None

Start time (inclusive), format: 'YYYY-MM-DD' or datetime-compatible string

None
end str | None

End time (exclusive), format: 'YYYY-MM-DD' or datetime-compatible string

None
before str | None

Filter to times before this value (exclusive), alternative to end

None
after str | None

Filter to times after this value (inclusive), alternative to start

None

Returns:

Type Description
TimeSeries

New TimeSeries with filtered data

Example

Filter to specific range

ts_filtered = ts.filter_time(start='2024-01-01', end='2024-12-31')

Filter before a date

ts_clean = ts.filter_time(before='2025-10-05')

Filter after a date

ts_recent = ts.filter_time(after='2024-01-01')

from_dataframe(df, hierarchy, metric_col='cost', time_col='date') classmethod

Factory method to create TimeSeries from DataFrame.

Parameters:

Name Type Description Default
df DataFrame

PySpark DataFrame with time series data

required
hierarchy list[str]

Ordered key columns (e.g., ["provider", "account"])

required
metric_col str

Name of metric column (default: "cost")

'cost'
time_col str

Name of timestamp column (default: "date")

'date'

Returns:

Type Description
TimeSeries

TimeSeries instance

plot_cost_distribution(grain, top_n=10, sort_by='total', min_cost=0.0, log_scale=False, group_by_parent=True, title=None, figsize=(14, 6))

Plot daily cost distribution for entities at specified grain.

Shows box plot with one box per entity, displaying: - Median (line in box) - 25th-75th percentiles (box) - Whiskers (1.5 * IQR) - Outliers (dots)

Parameters:

Name Type Description Default
grain list[str]

Dimension(s) to analyze (e.g., ['region'])

required
top_n int

Show top N entities by sort metric

10
sort_by str

Sort by 'total', 'mean', 'volatility', or 'median'

'total'
min_cost float

Filter out daily cost values below this threshold (default: 0.0)

0.0
log_scale bool

Use logarithmic y-axis scale

False
group_by_parent bool

If True, color boxes by parent hierarchy level (e.g., provider)

True
title str | None

Plot title (None = auto-generate)

None
figsize tuple[int, int]

Figure size (width, height)

(14, 6)

Returns:

Type Description
Figure

Matplotlib Figure object

Example

Top 10 regions grouped/colored by provider

ts.plot_cost_distribution(['region'], top_n=10, group_by_parent=True)

Most volatile products

ts.plot_cost_distribution(['product'], top_n=5, sort_by='volatility')

plot_cost_treemap(hierarchy, top_n=30, title=None, width=1200, height=700)

Plot hierarchical cost treemap showing cost distribution across dimensions.

Creates nested rectangular tiles sized by total cost, with proper hierarchical grouping. All children of a parent are grouped together spatially (e.g., all AWS regions grouped).

Parameters:

Name Type Description Default
hierarchy list[str]

Hierarchy levels to display (e.g., ['provider', 'region'])

required
top_n int

Show only top N leaf entities by total cost (default 30)

30
title str | None

Plot title (None = auto-generate)

None
width int

Figure width in pixels

1200
height int

Figure height in pixels

700

Returns:

Type Description

Plotly Figure object (displays automatically in Jupyter)

Example

Cost breakdown by provider and region (nested)

ts.plot_cost_treemap(['provider', 'region'], top_n=20)

Deep hierarchy with grouping

ts.plot_cost_treemap(['provider', 'region', 'product'], top_n=50)

Plot cost trends over time for top entities at specified grain.

Shows time series with one line per entity, optionally including aggregate total.

Parameters:

Name Type Description Default
grain list[str]

Dimension(s) to analyze (e.g., ['region'])

required
top_n int

Show top N entities by sort metric

5
sort_by str

Sort by 'total', 'mean', 'volatility', or 'median'

'total'
show_total bool

If True, add line showing total across all entities

True
log_scale bool

Use logarithmic y-axis scale

False
title str | None

Plot title (None = auto-generate)

None
figsize tuple[int, int]

Figure size (width, height)

(14, 6)

Returns:

Type Description
Figure

Matplotlib Figure object

Example

ts.plot_cost_trends(['region'], top_n=5, show_total=True)

ts.plot_cost_trends(['product'], top_n=3, sort_by='volatility', show_total=False)

plot_density_by_grain(grains, log_scale=True, show_pct_change=False, title=None, figsize=(14, 5))

Plot temporal record density for multiple aggregation grains on a single figure.

For each grain, aggregates to that level and shows records per day over time. Optionally includes day-over-day percent change subplot below.

Parameters:

Name Type Description Default
grains list[str]

List of dimension names to plot (e.g., ['region', 'product', 'usage'])

required
log_scale bool

Use logarithmic y-axis scale (default True)

True
show_pct_change bool

If True, add day-over-day percent change subplot below

False
title str | None

Plot title (None = auto-generate)

None
figsize tuple[int, int]

Figure size (width, height)

(14, 5)

Returns:

Type Description
Figure

Matplotlib Figure object

Example

Compare temporal density across multiple aggregation grains

ts.plot_density_by_grain(['region', 'product', 'usage', 'provider'])

With percent change subplot

ts.plot_density_by_grain(['region', 'product'], show_pct_change=True)

plot_temporal_density(log_scale=False, title=None, figsize=(14, 5), **kwargs)

Plot temporal observation density at current grain.

Shows record count per timestamp to inspect observation consistency across time. Uses ConciseDateFormatter for adaptive date labeling that adjusts to the time range. Automatically generates subtitle with grain context and entity count.

Parameters:

Name Type Description Default
log_scale bool

Use logarithmic y-axis scale

False
title str | None

Plot title (None = auto-generate with grain context)

None
figsize tuple[int, int]

Figure size (width, height)

(14, 5)
**kwargs

Additional arguments passed to eda.plot_temporal_density()

{}

Returns:

Type Description
Figure

Matplotlib Figure object

Example

ts = PiedPiperLoader.load(df) ts.filter(account="123").plot_temporal_density(log_scale=True)

sample(grain, n=1)

Sample n random entities at specified grain level.

Parameters:

Name Type Description Default
grain list[str]

Column names defining the grain (must be subset of hierarchy)

required
n int

Number of entities to sample (default: 1)

1

Returns:

Type Description
TimeSeries

New TimeSeries with sampled entities

Example

ts.sample(grain=["account", "region"], n=10)

summary_stats(grain=None)

Compute summary statistics for the time series.

Parameters:

Name Type Description Default
grain list[str] | None

Optional grain to aggregate to before computing stats. If None, uses current grain of the data.

None

Returns:

Type Description
DataFrame

PySpark DataFrame with entity keys and summary statistics

DataFrame

(count, mean, std, min, max)

Example

stats = ts.summary_stats() # Stats at current grain stats = ts.summary_stats(grain=["account"]) # Aggregate first

with_df(df)

Create new TimeSeries with different DataFrame, preserving metadata.

Useful for applying transformations while keeping hierarchy/metric/time column info.

Parameters:

Name Type Description Default
df DataFrame

New DataFrame to wrap

required

Returns:

Type Description
TimeSeries

New TimeSeries instance with same metadata

Example

Filter and create new instance

filtered_df = ts.df.filter(F.col('cost') > 100) ts_filtered = ts.with_df(filtered_df)

hellocloud.io.PiedPiperLoader

Load PiedPiper billing data with EDA-informed defaults.

Applies column renames, drops low-information columns, and creates TimeSeries with standard hierarchy.

load(df, hierarchy=None, metric_col='cost', time_col='date', drop_cols=None) staticmethod

Load PiedPiper data into TimeSeries.

Parameters:

Name Type Description Default
df DataFrame

PySpark DataFrame with PiedPiper billing data

required
hierarchy list[str] | None

Custom hierarchy (default: DEFAULT_HIERARCHY)

None
metric_col str

Metric column name after rename (default: "cost")

'cost'
time_col str

Time column name after rename (default: "date")

'date'
drop_cols list[str] | None

Columns to drop (default: DROP_COLUMNS)

None

Returns:

Type Description
TimeSeries

TimeSeries instance with cleaned data