API Reference

Auto-generated documentation from Python docstrings.

Data Generation

`hellocloud.generation.WorkloadPatternGenerator`

Generate realistic cloud workload patterns based on research data

`generate_time_series(workload_type, start_time, end_time, interval_minutes=5)`

Generate time series data for a specific workload type

`hellocloud.generation.WorkloadType`

Bases: Enum

Different application workload patterns based on research

TimeSeries API

`hellocloud.timeseries.TimeSeries`

Wrapper around PySpark DataFrame for hierarchical time series analysis.

Attributes:

Name	Type	Description
`df`		PySpark DataFrame containing time series data
`hierarchy`		Ordered list of key columns (coarsest to finest grain)
`metric_col`		Name of the metric/value column
`time_col`		Name of the timestamp column

`init(df, hierarchy, metric_col, time_col)`

Initialize TimeSeries wrapper.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	PySpark DataFrame with time series data	required
`hierarchy`	`list[str]`	Ordered key columns (e.g., ["provider", "account", "region"])	required
`metric_col`	`str`	Name of metric column (e.g., "cost")	required
`time_col`	`str`	Name of timestamp column (e.g., "date")	required

Raises:

Type	Description
`TimeSeriesError`	If required columns missing from DataFrame

`aggregate(grain)`

Aggregate metric to coarser grain level.

Sums metric values across entities, grouping by grain + time.

Parameters:

Name	Type	Description	Default
`grain`	`list[str]`	Column names defining target grain (must be subset of hierarchy)	required

Returns:

Type	Description
`TimeSeries`	New TimeSeries aggregated to specified grain

Example

Aggregate from account+region to just account

ts.aggregate(grain=["account"])

`cost_summary_by_grain(grain, sort_by='total')`

Compute summary statistics for cost at specified grain.

For each entity at the grain, computes: - total_cost: Sum across all time - mean_cost: Average daily cost - median_cost: Median daily cost - std_cost: Standard deviation (volatility) - min_cost: Minimum daily cost - max_cost: Maximum daily cost - days: Number of days with data

Parameters:

Name	Type	Description	Default
`grain`	`list[str]`	Dimension(s) to analyze (e.g., ['region'] or ['provider', 'region'])	required
`sort_by`	`str`	Sort by 'total', 'mean', 'volatility' (std), or 'median'	`'total'`

Returns:

Type	Description
	PySpark DataFrame with summary statistics, sorted descending

Example

Top regions by total cost with volatility stats

stats = ts.cost_summary_by_grain(['region']) stats.show(10)

`filter(**entity_keys)`

Filter to specific entity by hierarchy column values.

Parameters:

Name	Type	Description	Default
`**entity_keys`		Column name/value pairs to filter on (must be columns in hierarchy)	`{}`

Returns:

Type	Description
`TimeSeries`	New TimeSeries with filtered DataFrame

Raises:

Type	Description
`TimeSeriesError`	If filter column not in hierarchy

Example

ts.filter(provider="AWS", account="acc1")

`filter_time(start=None, end=None, before=None, after=None)`

Filter time series to specified time range.

Supports multiple filtering styles: - Range filtering: start/end (inclusive start, exclusive end) - Single-sided: before (exclusive) or after (inclusive)

Parameters:

Name	Type	Description	Default
`start`	`str \| None`	Start time (inclusive), format: 'YYYY-MM-DD' or datetime-compatible string	`None`
`end`	`str \| None`	End time (exclusive), format: 'YYYY-MM-DD' or datetime-compatible string	`None`
`before`	`str \| None`	Filter to times before this value (exclusive), alternative to end	`None`
`after`	`str \| None`	Filter to times after this value (inclusive), alternative to start	`None`

Returns:

Type	Description
`TimeSeries`	New TimeSeries with filtered data

Example

Filter to specific range

ts_filtered = ts.filter_time(start='2024-01-01', end='2024-12-31')

Filter before a date

ts_clean = ts.filter_time(before='2025-10-05')

Filter after a date

ts_recent = ts.filter_time(after='2024-01-01')

`from_dataframe(df, hierarchy, metric_col='cost', time_col='date')` `classmethod`

Factory method to create TimeSeries from DataFrame.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	PySpark DataFrame with time series data	required
`hierarchy`	`list[str]`	Ordered key columns (e.g., ["provider", "account"])	required
`metric_col`	`str`	Name of metric column (default: "cost")	`'cost'`
`time_col`	`str`	Name of timestamp column (default: "date")	`'date'`

Returns:

Type	Description
`TimeSeries`	TimeSeries instance

`plot_cost_distribution(grain, top_n=10, sort_by='total', min_cost=0.0, log_scale=False, group_by_parent=True, title=None, figsize=(14, 6))`

Plot daily cost distribution for entities at specified grain.

Shows box plot with one box per entity, displaying: - Median (line in box) - 25th-75th percentiles (box) - Whiskers (1.5 * IQR) - Outliers (dots)

Parameters:

Name	Type	Description	Default
`grain`	`list[str]`	Dimension(s) to analyze (e.g., ['region'])	required
`top_n`	`int`	Show top N entities by sort metric	`10`
`sort_by`	`str`	Sort by 'total', 'mean', 'volatility', or 'median'	`'total'`
`min_cost`	`float`	Filter out daily cost values below this threshold (default: 0.0)	`0.0`
`log_scale`	`bool`	Use logarithmic y-axis scale	`False`
`group_by_parent`	`bool`	If True, color boxes by parent hierarchy level (e.g., provider)	`True`
`title`	`str \| None`	Plot title (None = auto-generate)	`None`
`figsize`	`tuple[int, int]`	Figure size (width, height)	`(14, 6)`

Returns:

Type	Description
`Figure`	Matplotlib Figure object

Example

Top 10 regions grouped/colored by provider

ts.plot_cost_distribution(['region'], top_n=10, group_by_parent=True)

Most volatile products

ts.plot_cost_distribution(['product'], top_n=5, sort_by='volatility')

`plot_cost_treemap(hierarchy, top_n=30, title=None, width=1200, height=700)`

Plot hierarchical cost treemap showing cost distribution across dimensions.

Creates nested rectangular tiles sized by total cost, with proper hierarchical grouping. All children of a parent are grouped together spatially (e.g., all AWS regions grouped).

Parameters:

Name	Type	Description	Default
`hierarchy`	`list[str]`	Hierarchy levels to display (e.g., ['provider', 'region'])	required
`top_n`	`int`	Show only top N leaf entities by total cost (default 30)	`30`
`title`	`str \| None`	Plot title (None = auto-generate)	`None`
`width`	`int`	Figure width in pixels	`1200`
`height`	`int`	Figure height in pixels	`700`

Returns:

Type	Description
	Plotly Figure object (displays automatically in Jupyter)

Example

Cost breakdown by provider and region (nested)

ts.plot_cost_treemap(['provider', 'region'], top_n=20)

Deep hierarchy with grouping

ts.plot_cost_treemap(['provider', 'region', 'product'], top_n=50)

`plot_cost_trends(grain, top_n=5, sort_by='total', show_total=True, log_scale=False, title=None, figsize=(14, 6))`

Plot cost trends over time for top entities at specified grain.

Shows time series with one line per entity, optionally including aggregate total.

Parameters:

Name	Type	Description	Default
`grain`	`list[str]`	Dimension(s) to analyze (e.g., ['region'])	required
`top_n`	`int`	Show top N entities by sort metric	`5`
`sort_by`	`str`	Sort by 'total', 'mean', 'volatility', or 'median'	`'total'`
`show_total`	`bool`	If True, add line showing total across all entities	`True`
`log_scale`	`bool`	Use logarithmic y-axis scale	`False`
`title`	`str \| None`	Plot title (None = auto-generate)	`None`
`figsize`	`tuple[int, int]`	Figure size (width, height)	`(14, 6)`

Returns:

Type	Description
`Figure`	Matplotlib Figure object

Example

Top 5 regions with trends and total

ts.plot_cost_trends(['region'], top_n=5, show_total=True)

Most volatile products without total

ts.plot_cost_trends(['product'], top_n=3, sort_by='volatility', show_total=False)

`plot_density_by_grain(grains, log_scale=True, show_pct_change=False, title=None, figsize=(14, 5))`

Plot temporal record density for multiple aggregation grains on a single figure.

For each grain, aggregates to that level and shows records per day over time. Optionally includes day-over-day percent change subplot below.

Parameters:

Name	Type	Description	Default
`grains`	`list[str]`	List of dimension names to plot (e.g., ['region', 'product', 'usage'])	required
`log_scale`	`bool`	Use logarithmic y-axis scale (default True)	`True`
`show_pct_change`	`bool`	If True, add day-over-day percent change subplot below	`False`
`title`	`str \| None`	Plot title (None = auto-generate)	`None`
`figsize`	`tuple[int, int]`	Figure size (width, height)	`(14, 5)`

Returns:

Type	Description
`Figure`	Matplotlib Figure object

Example

Compare temporal density across multiple aggregation grains

ts.plot_density_by_grain(['region', 'product', 'usage', 'provider'])

With percent change subplot

ts.plot_density_by_grain(['region', 'product'], show_pct_change=True)

`plot_temporal_density(log_scale=False, title=None, figsize=(14, 5), **kwargs)`

Plot temporal observation density at current grain.

Shows record count per timestamp to inspect observation consistency across time. Uses ConciseDateFormatter for adaptive date labeling that adjusts to the time range. Automatically generates subtitle with grain context and entity count.

Parameters:

Name	Type	Description	Default
`log_scale`	`bool`	Use logarithmic y-axis scale	`False`
`title`	`str \| None`	Plot title (None = auto-generate with grain context)	`None`
`figsize`	`tuple[int, int]`	Figure size (width, height)	`(14, 5)`
`**kwargs`		Additional arguments passed to eda.plot_temporal_density()	`{}`

Returns:

Type	Description
`Figure`	Matplotlib Figure object

Example

ts = PiedPiperLoader.load(df) ts.filter(account="123").plot_temporal_density(log_scale=True)

`sample(grain, n=1)`

Sample n random entities at specified grain level.

Parameters:

Name	Type	Description	Default
`grain`	`list[str]`	Column names defining the grain (must be subset of hierarchy)	required
`n`	`int`	Number of entities to sample (default: 1)	`1`

Returns:

Type	Description
`TimeSeries`	New TimeSeries with sampled entities

Example

ts.sample(grain=["account", "region"], n=10)

`summary_stats(grain=None)`

Compute summary statistics for the time series.

Parameters:

Name	Type	Description	Default
`grain`	`list[str] \| None`	Optional grain to aggregate to before computing stats. If None, uses current grain of the data.	`None`

Returns:

Type	Description
`DataFrame`	PySpark DataFrame with entity keys and summary statistics
`DataFrame`	(count, mean, std, min, max)

Example

stats = ts.summary_stats() # Stats at current grain stats = ts.summary_stats(grain=["account"]) # Aggregate first

`with_df(df)`

Create new TimeSeries with different DataFrame, preserving metadata.

Useful for applying transformations while keeping hierarchy/metric/time column info.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	New DataFrame to wrap	required

Returns:

Type	Description
`TimeSeries`	New TimeSeries instance with same metadata

Example

Filter and create new instance

filtered_df = ts.df.filter(F.col('cost') > 100) ts_filtered = ts.with_df(filtered_df)

`hellocloud.io.PiedPiperLoader`

Load PiedPiper billing data with EDA-informed defaults.

Applies column renames, drops low-information columns, and creates TimeSeries with standard hierarchy.

`load(df, hierarchy=None, metric_col='cost', time_col='date', drop_cols=None)` `staticmethod`

Load PiedPiper data into TimeSeries.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	PySpark DataFrame with PiedPiper billing data	required
`hierarchy`	`list[str] \| None`	Custom hierarchy (default: DEFAULT_HIERARCHY)	`None`
`metric_col`	`str`	Metric column name after rename (default: "cost")	`'cost'`
`time_col`	`str`	Time column name after rename (default: "date")	`'date'`
`drop_cols`	`list[str] \| None`	Columns to drop (default: DROP_COLUMNS)	`None`

Returns:

Type	Description
`TimeSeries`	TimeSeries instance with cleaned data

API Reference

Data Generation

hellocloud.generation.WorkloadPatternGenerator

generate_time_series(workload_type, start_time, end_time, interval_minutes=5)

hellocloud.generation.WorkloadType

TimeSeries API

hellocloud.timeseries.TimeSeries

__init__(df, hierarchy, metric_col, time_col)

aggregate(grain)

Aggregate from account+region to just account

cost_summary_by_grain(grain, sort_by='total')

Top regions by total cost with volatility stats

filter(**entity_keys)

filter_time(start=None, end=None, before=None, after=None)

Filter to specific range

Filter before a date

Filter after a date

from_dataframe(df, hierarchy, metric_col='cost', time_col='date') classmethod

plot_cost_distribution(grain, top_n=10, sort_by='total', min_cost=0.0, log_scale=False, group_by_parent=True, title=None, figsize=(14, 6))

Top 10 regions grouped/colored by provider

Most volatile products

plot_cost_treemap(hierarchy, top_n=30, title=None, width=1200, height=700)

Cost breakdown by provider and region (nested)

Deep hierarchy with grouping

plot_cost_trends(grain, top_n=5, sort_by='total', show_total=True, log_scale=False, title=None, figsize=(14, 6))

Top 5 regions with trends and total

Most volatile products without total

plot_density_by_grain(grains, log_scale=True, show_pct_change=False, title=None, figsize=(14, 5))

Compare temporal density across multiple aggregation grains

With percent change subplot

plot_temporal_density(log_scale=False, title=None, figsize=(14, 5), **kwargs)

sample(grain, n=1)

summary_stats(grain=None)

with_df(df)

Filter and create new instance

hellocloud.io.PiedPiperLoader

load(df, hierarchy=None, metric_col='cost', time_col='date', drop_cols=None) staticmethod

`hellocloud.generation.WorkloadPatternGenerator`

`generate_time_series(workload_type, start_time, end_time, interval_minutes=5)`

`hellocloud.generation.WorkloadType`

`hellocloud.timeseries.TimeSeries`

`init(df, hierarchy, metric_col, time_col)`

`aggregate(grain)`

`cost_summary_by_grain(grain, sort_by='total')`

`filter(**entity_keys)`

`filter_time(start=None, end=None, before=None, after=None)`

`from_dataframe(df, hierarchy, metric_col='cost', time_col='date')` `classmethod`

`plot_cost_distribution(grain, top_n=10, sort_by='total', min_cost=0.0, log_scale=False, group_by_parent=True, title=None, figsize=(14, 6))`

`plot_cost_treemap(hierarchy, top_n=30, title=None, width=1200, height=700)`

`plot_cost_trends(grain, top_n=5, sort_by='total', show_total=True, log_scale=False, title=None, figsize=(14, 6))`

`plot_density_by_grain(grains, log_scale=True, show_pct_change=False, title=None, figsize=(14, 5))`

`plot_temporal_density(log_scale=False, title=None, figsize=(14, 5), **kwargs)`

`sample(grain, n=1)`

`summary_stats(grain=None)`

`with_df(df)`

`hellocloud.io.PiedPiperLoader`

`load(df, hierarchy=None, metric_col='cost', time_col='date', drop_cols=None)` `staticmethod`