API Reference
Auto-generated documentation from Python docstrings.
Data Generation
hellocloud.generation.WorkloadPatternGenerator
Generate realistic cloud workload patterns based on research data
generate_time_series(workload_type, start_time, end_time, interval_minutes=5)
Generate time series data for a specific workload type
hellocloud.generation.WorkloadType
Bases: Enum
Different application workload patterns based on research
TimeSeries API
hellocloud.timeseries.TimeSeries
Wrapper around PySpark DataFrame for hierarchical time series analysis.
Attributes:
Name | Type | Description |
---|---|---|
df |
PySpark DataFrame containing time series data |
|
hierarchy |
Ordered list of key columns (coarsest to finest grain) |
|
metric_col |
Name of the metric/value column |
|
time_col |
Name of the timestamp column |
__init__(df, hierarchy, metric_col, time_col)
Initialize TimeSeries wrapper.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
PySpark DataFrame with time series data |
required |
hierarchy
|
list[str]
|
Ordered key columns (e.g., ["provider", "account", "region"]) |
required |
metric_col
|
str
|
Name of metric column (e.g., "cost") |
required |
time_col
|
str
|
Name of timestamp column (e.g., "date") |
required |
Raises:
Type | Description |
---|---|
TimeSeriesError
|
If required columns missing from DataFrame |
aggregate(grain)
Aggregate metric to coarser grain level.
Sums metric values across entities, grouping by grain + time.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
grain
|
list[str]
|
Column names defining target grain (must be subset of hierarchy) |
required |
Returns:
Type | Description |
---|---|
TimeSeries
|
New TimeSeries aggregated to specified grain |
Example
Aggregate from account+region to just account
ts.aggregate(grain=["account"])
cost_summary_by_grain(grain, sort_by='total')
Compute summary statistics for cost at specified grain.
For each entity at the grain, computes: - total_cost: Sum across all time - mean_cost: Average daily cost - median_cost: Median daily cost - std_cost: Standard deviation (volatility) - min_cost: Minimum daily cost - max_cost: Maximum daily cost - days: Number of days with data
Parameters:
Name | Type | Description | Default |
---|---|---|---|
grain
|
list[str]
|
Dimension(s) to analyze (e.g., ['region'] or ['provider', 'region']) |
required |
sort_by
|
str
|
Sort by 'total', 'mean', 'volatility' (std), or 'median' |
'total'
|
Returns:
Type | Description |
---|---|
PySpark DataFrame with summary statistics, sorted descending |
Example
Top regions by total cost with volatility stats
stats = ts.cost_summary_by_grain(['region']) stats.show(10)
filter(**entity_keys)
Filter to specific entity by hierarchy column values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
**entity_keys
|
Column name/value pairs to filter on (must be columns in hierarchy) |
{}
|
Returns:
Type | Description |
---|---|
TimeSeries
|
New TimeSeries with filtered DataFrame |
Raises:
Type | Description |
---|---|
TimeSeriesError
|
If filter column not in hierarchy |
Example
ts.filter(provider="AWS", account="acc1")
filter_time(start=None, end=None, before=None, after=None)
Filter time series to specified time range.
Supports multiple filtering styles: - Range filtering: start/end (inclusive start, exclusive end) - Single-sided: before (exclusive) or after (inclusive)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
start
|
str | None
|
Start time (inclusive), format: 'YYYY-MM-DD' or datetime-compatible string |
None
|
end
|
str | None
|
End time (exclusive), format: 'YYYY-MM-DD' or datetime-compatible string |
None
|
before
|
str | None
|
Filter to times before this value (exclusive), alternative to end |
None
|
after
|
str | None
|
Filter to times after this value (inclusive), alternative to start |
None
|
Returns:
Type | Description |
---|---|
TimeSeries
|
New TimeSeries with filtered data |
Example
Filter to specific range
ts_filtered = ts.filter_time(start='2024-01-01', end='2024-12-31')
Filter before a date
ts_clean = ts.filter_time(before='2025-10-05')
Filter after a date
ts_recent = ts.filter_time(after='2024-01-01')
from_dataframe(df, hierarchy, metric_col='cost', time_col='date')
classmethod
Factory method to create TimeSeries from DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
PySpark DataFrame with time series data |
required |
hierarchy
|
list[str]
|
Ordered key columns (e.g., ["provider", "account"]) |
required |
metric_col
|
str
|
Name of metric column (default: "cost") |
'cost'
|
time_col
|
str
|
Name of timestamp column (default: "date") |
'date'
|
Returns:
Type | Description |
---|---|
TimeSeries
|
TimeSeries instance |
plot_cost_distribution(grain, top_n=10, sort_by='total', min_cost=0.0, log_scale=False, group_by_parent=True, title=None, figsize=(14, 6))
Plot daily cost distribution for entities at specified grain.
Shows box plot with one box per entity, displaying: - Median (line in box) - 25th-75th percentiles (box) - Whiskers (1.5 * IQR) - Outliers (dots)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
grain
|
list[str]
|
Dimension(s) to analyze (e.g., ['region']) |
required |
top_n
|
int
|
Show top N entities by sort metric |
10
|
sort_by
|
str
|
Sort by 'total', 'mean', 'volatility', or 'median' |
'total'
|
min_cost
|
float
|
Filter out daily cost values below this threshold (default: 0.0) |
0.0
|
log_scale
|
bool
|
Use logarithmic y-axis scale |
False
|
group_by_parent
|
bool
|
If True, color boxes by parent hierarchy level (e.g., provider) |
True
|
title
|
str | None
|
Plot title (None = auto-generate) |
None
|
figsize
|
tuple[int, int]
|
Figure size (width, height) |
(14, 6)
|
Returns:
Type | Description |
---|---|
Figure
|
Matplotlib Figure object |
Example
Top 10 regions grouped/colored by provider
ts.plot_cost_distribution(['region'], top_n=10, group_by_parent=True)
Most volatile products
ts.plot_cost_distribution(['product'], top_n=5, sort_by='volatility')
plot_cost_treemap(hierarchy, top_n=30, title=None, width=1200, height=700)
Plot hierarchical cost treemap showing cost distribution across dimensions.
Creates nested rectangular tiles sized by total cost, with proper hierarchical grouping. All children of a parent are grouped together spatially (e.g., all AWS regions grouped).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
hierarchy
|
list[str]
|
Hierarchy levels to display (e.g., ['provider', 'region']) |
required |
top_n
|
int
|
Show only top N leaf entities by total cost (default 30) |
30
|
title
|
str | None
|
Plot title (None = auto-generate) |
None
|
width
|
int
|
Figure width in pixels |
1200
|
height
|
int
|
Figure height in pixels |
700
|
Returns:
Type | Description |
---|---|
Plotly Figure object (displays automatically in Jupyter) |
Example
Cost breakdown by provider and region (nested)
ts.plot_cost_treemap(['provider', 'region'], top_n=20)
Deep hierarchy with grouping
ts.plot_cost_treemap(['provider', 'region', 'product'], top_n=50)
plot_cost_trends(grain, top_n=5, sort_by='total', show_total=True, log_scale=False, title=None, figsize=(14, 6))
Plot cost trends over time for top entities at specified grain.
Shows time series with one line per entity, optionally including aggregate total.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
grain
|
list[str]
|
Dimension(s) to analyze (e.g., ['region']) |
required |
top_n
|
int
|
Show top N entities by sort metric |
5
|
sort_by
|
str
|
Sort by 'total', 'mean', 'volatility', or 'median' |
'total'
|
show_total
|
bool
|
If True, add line showing total across all entities |
True
|
log_scale
|
bool
|
Use logarithmic y-axis scale |
False
|
title
|
str | None
|
Plot title (None = auto-generate) |
None
|
figsize
|
tuple[int, int]
|
Figure size (width, height) |
(14, 6)
|
Returns:
Type | Description |
---|---|
Figure
|
Matplotlib Figure object |
Example
Top 5 regions with trends and total
ts.plot_cost_trends(['region'], top_n=5, show_total=True)
Most volatile products without total
ts.plot_cost_trends(['product'], top_n=3, sort_by='volatility', show_total=False)
plot_density_by_grain(grains, log_scale=True, show_pct_change=False, title=None, figsize=(14, 5))
Plot temporal record density for multiple aggregation grains on a single figure.
For each grain, aggregates to that level and shows records per day over time. Optionally includes day-over-day percent change subplot below.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
grains
|
list[str]
|
List of dimension names to plot (e.g., ['region', 'product', 'usage']) |
required |
log_scale
|
bool
|
Use logarithmic y-axis scale (default True) |
True
|
show_pct_change
|
bool
|
If True, add day-over-day percent change subplot below |
False
|
title
|
str | None
|
Plot title (None = auto-generate) |
None
|
figsize
|
tuple[int, int]
|
Figure size (width, height) |
(14, 5)
|
Returns:
Type | Description |
---|---|
Figure
|
Matplotlib Figure object |
Example
Compare temporal density across multiple aggregation grains
ts.plot_density_by_grain(['region', 'product', 'usage', 'provider'])
With percent change subplot
ts.plot_density_by_grain(['region', 'product'], show_pct_change=True)
plot_temporal_density(log_scale=False, title=None, figsize=(14, 5), **kwargs)
Plot temporal observation density at current grain.
Shows record count per timestamp to inspect observation consistency across time. Uses ConciseDateFormatter for adaptive date labeling that adjusts to the time range. Automatically generates subtitle with grain context and entity count.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
log_scale
|
bool
|
Use logarithmic y-axis scale |
False
|
title
|
str | None
|
Plot title (None = auto-generate with grain context) |
None
|
figsize
|
tuple[int, int]
|
Figure size (width, height) |
(14, 5)
|
**kwargs
|
Additional arguments passed to eda.plot_temporal_density() |
{}
|
Returns:
Type | Description |
---|---|
Figure
|
Matplotlib Figure object |
Example
ts = PiedPiperLoader.load(df) ts.filter(account="123").plot_temporal_density(log_scale=True)
sample(grain, n=1)
Sample n random entities at specified grain level.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
grain
|
list[str]
|
Column names defining the grain (must be subset of hierarchy) |
required |
n
|
int
|
Number of entities to sample (default: 1) |
1
|
Returns:
Type | Description |
---|---|
TimeSeries
|
New TimeSeries with sampled entities |
Example
ts.sample(grain=["account", "region"], n=10)
summary_stats(grain=None)
Compute summary statistics for the time series.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
grain
|
list[str] | None
|
Optional grain to aggregate to before computing stats. If None, uses current grain of the data. |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
PySpark DataFrame with entity keys and summary statistics |
DataFrame
|
(count, mean, std, min, max) |
Example
stats = ts.summary_stats() # Stats at current grain stats = ts.summary_stats(grain=["account"]) # Aggregate first
with_df(df)
Create new TimeSeries with different DataFrame, preserving metadata.
Useful for applying transformations while keeping hierarchy/metric/time column info.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
New DataFrame to wrap |
required |
Returns:
Type | Description |
---|---|
TimeSeries
|
New TimeSeries instance with same metadata |
Example
Filter and create new instance
filtered_df = ts.df.filter(F.col('cost') > 100) ts_filtered = ts.with_df(filtered_df)
hellocloud.io.PiedPiperLoader
Load PiedPiper billing data with EDA-informed defaults.
Applies column renames, drops low-information columns, and creates TimeSeries with standard hierarchy.
load(df, hierarchy=None, metric_col='cost', time_col='date', drop_cols=None)
staticmethod
Load PiedPiper data into TimeSeries.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
PySpark DataFrame with PiedPiper billing data |
required |
hierarchy
|
list[str] | None
|
Custom hierarchy (default: DEFAULT_HIERARCHY) |
None
|
metric_col
|
str
|
Metric column name after rename (default: "cost") |
'cost'
|
time_col
|
str
|
Time column name after rename (default: "date") |
'date'
|
drop_cols
|
list[str] | None
|
Columns to drop (default: DROP_COLUMNS) |
None
|
Returns:
Type | Description |
---|---|
TimeSeries
|
TimeSeries instance with cleaned data |