----------------------------------------------------------------------
This is the API documentation for the sim_tools library.
----------------------------------------------------------------------


## Classes

Main classes provided by the package


Bernoulli(p: float, random_seed: Union[int, numpy.random.bit_generator.SeedSequence, NoneType] = None)

Bernoulli distribution implementation.

A discrete probability distribution that takes value 1 with probability p
and value 0 with probability 1-p.

This class conforms to the Distribution protocol and provides methods to
sample from a Bernoulli distribution with a specified probability.

Beta(alpha1: float, alpha2: float, lower_bound: float = 0.0, upper_bound: float = 1.0, random_seed: Union[int, numpy.random.bit_generator.SeedSequence, NoneType] = None)

Beta distribution implementation.

A flexible continuous probability distribution defined on the interval
[0,1], which can be rescaled to any arbitrary interval [min, max].

As defined in Simulation Modeling and Analysis (Law, 2007).

Common uses:
-----------
1. Useful as a rough model in the absence of data
2. Distribution of a random proportion
3. Time to complete a task

CombinationDistribution(*args: sim_tools.distributions.Distribution, dists=None)

Combination distribution implementation.

A distribution that combines (sums) samples from multiple underlying
distributions. Useful for modeling compound effects or building complex
distributions from simpler ones.

This class conforms to the Distribution protocol and provides methods to
sample a combination of values from multiple distributions.

DiscreteEmpirical(values: Union[numpy._typing._array_like._Buffer, numpy._typing._array_like._SupportsArray[numpy.dtype[Any]], numpy._typing._nested_sequence._NestedSequence[numpy._typing._array_like._SupportsArray[numpy.dtype[Any]]], complex, bytes, str, numpy._typing._nested_sequence._NestedSequence[complex | bytes | str]], freq: Union[numpy._typing._array_like._Buffer, numpy._typing._array_like._SupportsArray[numpy.dtype[Any]], numpy._typing._nested_sequence._NestedSequence[numpy._typing._array_like._SupportsArray[numpy.dtype[Any]]], complex, bytes, str, numpy._typing._nested_sequence._NestedSequence[complex | bytes | str]], random_seed: Union[int, numpy.random.bit_generator.SeedSequence, NoneType] = None)

DiscreteEmpirical distribution implementation.

A probability distribution that samples values with specified frequencies.
Useful for modeling categorical data or discrete outcomes with known
probabilities.

Example uses:
-------------
1. Routing percentages
2. Classes of entity
3. Batch sizes of arrivals

DistributionRegistry()

Registry for probability distribution classes with batch creation
capabilities.

The DistributionRegistry provides a central repository for registering
distribution classes and instantiating them from configuration data. This
facilitates dynamic creation of distribution objects based on runtime
configuration, supporting scenarios like simulation models, statistical
analysis, or any application requiring configurable random distributions.

Key features:
- Register distribution classes with a simple decorator
- Create distribution instances from class names and parameters
- Batch create multiple distributions from a dictionary or list
- Automatic generation of statistically independent random seeds

Examples
--------
Register distribution classes:

>>> @DistributionRegistry.register()
... class Exponential:
...     def __init__(self, mean, random_seed=None):
...         self.rng = np.random.default_rng(random_seed)
...         self.mean = mean
...
...     def sample(self, size=None):
...         return self.rng.exponential(self.mean, size=size)
...
>>> @DistributionRegistry.register("uniform_dist")  # Custom name
... class Uniform:
...     def __init__(self, low, high, random_seed=None):
...         self.rng = np.random.default_rng(random_seed)
...         self.low = low
...         self.high = high
...
...     def sample(self, size=None):
...         return self.rng.uniform(self.low, self.high, size=size)

Create a distribution with parameters:

>>> exp_dist = DistributionRegistry.create("Exponential", mean=2.0)
>>> exp_dist.sample(5)  # Generate 5 samples

Create multiple distributions from a flat configuration:

>>> config = {
...     "arrivals": {
...         "class_name": "Exponential",
...         "params": {"mean": 5.0}
...     },
...     "service_times": {
...         "class_name": "uniform_dist",
...         "params": {"low": 1.0, "high": 3.0}
...     }
... }
>>> distributions = DistributionRegistry.create_batch(config,
...                                                    main_seed=12345)
>>> arrivals = distributions["arrivals"]
>>> service_times = distributions["service_times"]

Create distributions from a nested configuration:

>>> config = {
...     "call": {
...         "C1": {"class_name": "Exponential", "params": {"mean": 10.0}},
...         "C2": {"class_name": "Exponential", "params": {"mean": 8.0}},
...     },
...     "response_time": {
...         "C1": {
...             "class_name": "Lognormal",
...             "params": {"mean": 30.0, "stdev": 5.0},
...         },
...     },
... }
>>> dists = DistributionRegistry.create_batch(
...     config, main_seed=42, preserve_structure=True
... )
>>> dists["call"]["C1"].sample(5)
>>> dists["response_time"]["C1"].sample(5)

Notes
-----
When creating distributions in batch with a main_seed, each distribution
receives its own statistically independent seed derived from the main seed.
This ensures proper statistical independence between random number streams
while maintaining overall reproducibility through the main seed.

When `preserve_structure=True`, the output mirrors the shape of the
input configuration, so nested configs produce nested dicts of instances.
When `preserve_structure=False` (default), all leaf distributions are
flattened into a single dict whose keys are the path components joined by
underscores (e.g. `"call_C1"`).

With `sort=True` (default), keys at every nesting level are sorted
alphabetically before seeds are assigned. This ensures that seed
assignment is stable even if the insertion order of keys in the config
dict changes between runs.

Erlang(mean: float, stdev: float, location: float = 0.0, random_seed: Union[int, numpy.random.bit_generator.SeedSequence, NoneType] = None)

Erlang distribution implementation.

A continuous probability distribution that is a special case of the Gamma
distribution where the shape parameter is an integer. This implementation
allows users to specify mean and standard deviation rather than shape (k)
and scale (theta) parameters.

This class conforms to the Distribution protocol and provides methods to
sample from an Erlang distribution with specified parameters.

Notes
-----
The Erlang is a special case of the gamma distribution where k is an
integer. Internally this is implemented using numpy Generator's gamma
method. The k parameter is calculated from the mean and standard deviation
and rounded to an integer.

Sources
-------
Conversion between mean+stdev to k+theta:
https://www.statisticshowto.com/erlang-distribution/

ErlangK(k: int, theta: float, location: float = 0.0, random_seed: Union[int, numpy.random.bit_generator.SeedSequence, NoneType] = None)

Erlang distribution where k and theta are specified.

The Erlang is a special case of the gamma distribution where
k is a positive integer. Internally this is implemented using
numpy Generator's gamma method.

Optionally a user can offset the origin of the distribution
using the location parameter.

Exponential(mean: float, random_seed: Union[int, numpy.random.bit_generator.SeedSequence, NoneType] = None)

Exponential distribution implementation.

A probability distribution that models the time between events in a Poisson
process, where events occur continuously and independently at a constant
average rate.

This class conforms to the Distribution protocol and provides methods to
sample from an exponential distribution with a specified mean.

FixedDistribution(value: float)

Fixed distribution implementation.

A degenerate distribution that always returns the same fixed value.
Useful for constants or deterministic parameters in models.

This class conforms to the Distribution protocol and provides methods to
sample a constant value regardless of the number of samples requested.

Gamma(alpha: float, beta: float, location: float = 0.0, random_seed: Union[int, numpy.random.bit_generator.SeedSequence, NoneType] = None)

Gamma distribution implementation with shape (alpha) and scale (beta)
parameters.

This class conforms to the Distribution protocol and provides methods to:
- Calculate theoretical mean and variance
- Derive parameters from specified mean/variance
- Generate samples from the distribution

GroupedContinuousEmpirical(lower_bounds: Union[numpy._typing._array_like._Buffer, numpy._typing._array_like._SupportsArray[numpy.dtype[Any]], numpy._typing._nested_sequence._NestedSequence[numpy._typing._array_like._SupportsArray[numpy.dtype[Any]]], complex, bytes, str, numpy._typing._nested_sequence._NestedSequence[complex | bytes | str]], upper_bounds: Union[numpy._typing._array_like._Buffer, numpy._typing._array_like._SupportsArray[numpy.dtype[Any]], numpy._typing._nested_sequence._NestedSequence[numpy._typing._array_like._SupportsArray[numpy.dtype[Any]]], complex, bytes, str, numpy._typing._nested_sequence._NestedSequence[complex | bytes | str]], freq: Union[numpy._typing._array_like._Buffer, numpy._typing._array_like._SupportsArray[numpy.dtype[Any]], numpy._typing._nested_sequence._NestedSequence[numpy._typing._array_like._SupportsArray[numpy.dtype[Any]]], complex, bytes, str, numpy._typing._nested_sequence._NestedSequence[complex | bytes | str]], random_seed: Union[int, numpy.random.bit_generator.SeedSequence, NoneType] = None)

Continuous Empirical Distribution for Grouped Data implementation.

A distribution that performs linear interpolation between upper and lower
bounds of a discrete distribution. Useful for modeling empirical data with
a continuous approximation.

This class conforms to the Distribution protocol and provides methods to
sample from a continuous empirical distribution.

Hyperexponential(probs: Union[numpy._typing._array_like._Buffer, numpy._typing._array_like._SupportsArray[numpy.dtype[Any]], numpy._typing._nested_sequence._NestedSequence[numpy._typing._array_like._SupportsArray[numpy.dtype[Any]]], complex, bytes, str, numpy._typing._nested_sequence._NestedSequence[complex | bytes | str]], rates: Union[numpy._typing._array_like._Buffer, numpy._typing._array_like._SupportsArray[numpy.dtype[Any]], numpy._typing._nested_sequence._NestedSequence[numpy._typing._array_like._SupportsArray[numpy.dtype[Any]]], complex, bytes, str, numpy._typing._nested_sequence._NestedSequence[complex | bytes | str]], random_seed: Union[int, numpy.random.bit_generator.SeedSequence, NoneType] = None)

Hyperexponential distribution implementation.

A continuous probability distribution that is a mixture (weighted sum) of
exponential distributions. It has a higher coefficient of variation than
a single exponential distribution, making it useful for modeling highly
variable processes or heavy-tailed phenomena.

The hyperexponential distribution is useful to model service processes
where customers may require fundamentally different types of service
with varying durations. For example, in a technical support call center,
customers might either:

1. Have a simple issue (resolved quickly with rate λ₁) with probability p₁

2. Have a complex issue (requiring longer service with rate λ₂) with
probability p₂

This class conforms to the Distribution protocol and provides methods to
sample from a hyperexponential distribution with specified phase
probabilities and rates.

Lognormal(mean: float, stdev: float, random_seed: Union[int, numpy.random.bit_generator.SeedSequence, NoneType] = None)

Lognormal distribution implementation.

A continuous probability distribution where the logarithm of a random
variable is normally distributed. It is useful for modeling variables that
are the product of many small independent factors.

This class conforms to the Distribution protocol and provides methods to
sample from a lognormal distribution with a specified mean and standard
deviation.

Normal(mean: float, sigma: float, minimum: Optional[float] = None, random_seed: Union[int, numpy.random.bit_generator.SeedSequence, NoneType] = None)

Normal distribution implementation with optional truncation.

A continuous probability distribution that follows the Gaussian bell curve.
This implementation allows truncating the distribution at a minimum value.

This class conforms to the Distribution protocol and provides methods to
sample from a normal distribution with specified mean and standard
deviation.

PearsonV(alpha: float, beta: float, random_seed: Union[int, numpy.random.bit_generator.SeedSequence, NoneType] = None)

Pearson Type V distribution implementation (inverse Gamma distribution).

Where alpha = shape, and beta = scale (both > 0).

Law (2007, pg 293-294) defines the distribution as
PearsonV(alpha, beta) = 1/Gamma(alpha, 1/beta) and notes that the
PDF is similar to that of lognormal, but has a larger spike
close to 0. It can be used to model the time to complete a task.

For certain values of the shape parameter the mean and variance can be
directly computed:

mean = beta / (alpha - 1) for alpha > 1.0
var = beta^2 / (alpha - 1)^2 × (alpha - 2) for alpha > 2.0

This class conforms to the Distribution protocol.

Alternative Sources:
--------------------
[1] https://riskwiki.vosesoftware.com/PearsonType5distribution.php
[2] https://modelassist.epixanalytics.com/display/EA/Pearson+Type+5

Note
----
A good R package for Pearson distributions is PearsonDS
https://www.rdocumentation.org/packages/PearsonDS/versions/1.3.0

PearsonVI(alpha1: float, alpha2: float, beta: float, random_seed: Union[int, numpy.random.bit_generator.SeedSequence, NoneType] = None)

Pearson Type VI distribution implementation (inverted beta distribution).

Where:
- alpha1 = shape parameter 1 (> 0)
- alpha2 = shape parameter 2 (> 0)
- beta = scale (> 0)

Law (2007, pg 294-295) notes that PearsonVI can be used to model
the time to complete a task.

For certain values of the shape parameters, the mean and variance can be
directly computed. See functions mean() and var() for details.

Sampling:
---------
Pearson6(a1,a2,b) = b*X/(1-X), where X=Beta(a1,a2)

This class conforms to the Distribution protocol.

Sources:
--------
[1] https://riskwiki.vosesoftware.com/PearsonType6distribution.php

Note
----
A good R package for Pearson distributions is PearsonDS
https://www.rdocumentation.org/packages/PearsonDS/versions/1.3.0

Poisson(rate: float, random_seed: Union[int, numpy.random.bit_generator.SeedSequence, NoneType] = None)

Poisson distribution implementation.

Used to simulate number of events that occur in an interval of time.
E.g. number of items in a batch.

This class conforms to the Distribution protocol.

Sources:
--------
Law (2007 pg. 308) Simulation modelling and analysis.

RawContinuousEmpirical(data: Union[numpy._typing._array_like._Buffer, numpy._typing._array_like._SupportsArray[numpy.dtype[Any]], numpy._typing._nested_sequence._NestedSequence[numpy._typing._array_like._SupportsArray[numpy.dtype[Any]]], complex, bytes, str, numpy._typing._nested_sequence._NestedSequence[complex | bytes | str]], random_seed: Union[int, numpy.random.bit_generator.SeedSequence, NoneType] = None)

Continuous Empirical Distribution for Raw Data using Law and Kelton's
method.

A distribution that performs linear interpolation between data points
according to the algorithm described in Law & Kelton's "Simulation Modeling
and Analysis". The implementation follows a two-step approach:

1. Generate U ~ Uniform(0, 1), calculate P = (n-1)U, and I = int(P) + 1
2. Return X_I + (P-I)(X_{I+1} - X_I)

This approach ensures proper weighting across intervals and is suitable for
both Monte Carlo and discrete-event simulation applications.

Maximum and minimum values of the distribution are defined by the data.

RawDiscreteEmpirical(values: Union[numpy._typing._array_like._Buffer, numpy._typing._array_like._SupportsArray[numpy.dtype[Any]], numpy._typing._nested_sequence._NestedSequence[numpy._typing._array_like._SupportsArray[numpy.dtype[Any]]], complex, bytes, str, numpy._typing._nested_sequence._NestedSequence[complex | bytes | str]], random_seed: Union[int, numpy.random.bit_generator.SeedSequence, NoneType] = None)

Raw Empirical distribution implementation.

Samples with replacement from a list of empirical values. Useful when no
theoretical distribution fits the observed data well.

This class conforms to the Distribution protocol.

Triangular(low: float, mode: float, high: float, random_seed: Union[int, numpy.random.bit_generator.SeedSequence, NoneType] = None)

Triangular distribution implementation.

A continuous probability distribution with lower limit, upper limit, and
mode, forming a triangular-shaped probability density function.

This class conforms to the Distribution protocol and provides methods to
sample from a triangular distribution with specified parameters.

TruncatedDistribution(dist_to_truncate: sim_tools.distributions.Distribution, lower_bound: float)

Truncated Distribution implementation.

Wraps any distribution conforming to the Distribution protocol and
truncates samples at a specified lower bound. No resampling is performed;
the class simply ensures no values are below the lower bound.

This class itself conforms to the Distribution protocol.

Uniform(low: float, high: float, random_seed: Union[int, numpy.random.bit_generator.SeedSequence, NoneType] = None)

Uniform distribution implementation.

A continuous probability distribution where all values in a range have
equal probability of being sampled.

This class conforms to the Distribution protocol and provides methods to
sample from a uniform distribution between specified low and high values.

Weibull(alpha: float, beta: float, location: float = 0.0, random_seed: Union[int, numpy.random.bit_generator.SeedSequence, NoneType] = None)

Weibull distribution implementation.

A continuous probability distribution useful for modeling time-to-failure
and similar phenomena. Characterized by shape (alpha) and scale (beta)
parameters.

This implementation also includes a third parameter "location" (default=0)
to shift the distribution if a lower bound is needed.

The probability density function (PDF) is:
f(x) = (α/β) * ((x-location)/β)^(α-1) * exp(-((x-location)/β)^α)
for x ≥ location, where α is the shape parameter and β is the scale
parameter.

The samples are generated using:
X = scale × (-ln(U))^(1/shape) + location
where U is a uniform random number between 0 and 1.

The Weibull distribution reduces to:
- Exponential distribution when shape=1
- Rayleigh distribution when shape=2
- Approximately Normal distribution when shape≈3.4

OnlineStatistics(data: Optional[numpy.ndarray] = None, alpha: Optional[float] = 0.1, observer: Optional[sim_tools.output_analysis.ReplicationObserver] = None) -> None

Computes running sample mean and variance using Welford's algorithm.

This is a robust and numerically stable approach first described in the
1960s and popularised in Donald Knuth's *The Art of Computer Programming*
(Vol. 2).

The term *"online"* means each new data point is processed immediately
to update statistics, without storing or reprocessing the entire dataset.

This implementation additionally supports computation of:
  - Confidence intervals (CIs).
  - Percentage deviation of CI half-widths from the mean.

Attributes
----------
n : int
    Number of data points processed so far.
x_i : float
    Most recent data point.
mean : float
    Current running mean.
_sq : float
    Sum of squared differences from the current mean (used for variance).
alpha : float
    Significance level for confidence interval calculations
observer : list
    Registered observers notified upon updates.

ReplicationTabulizer()

Observer class for recording replication results from an
`OnlineStatistics` instance during simulation runs or repeated experiments.

Implements the observer pattern to collect statistics after each update
from the observed object, enabling later tabulation and analysis. After
data collection, results can be exported as a summary dataframe (equivalent
Implement as the part of observer pattern. Provides a summary frame
to the output of `confidence_interval_method`).

Attributes
----------
stdev : list[float]
    Sequence of recorded standard deviations.
lower : list[float]
    Sequence of recorded lower confidence interval bounds.
upper : list[float]
    Sequence of recorded upper confidence interval bounds.
dev : list[float]
    Sequence of recorded percentage deviations of CI half-width from the
    mean.
cumulative_mean : list[float]
    Sequence of running mean values.
x_i : list[float]
    Sequence of last observed raw data points.
n : int
    Total number of updates recorded.

ReplicationsAlgorithm(alpha: Optional[float] = 0.05, half_width_precision: Optional[float] = 0.1, initial_replications: Optional[int] = 3, look_ahead: Optional[int] = 5, replication_budget: Optional[float] = 1000, verbose: Optional[bool] = False, observer_factory: Optional[Callable[[], sim_tools.output_analysis.AlgorithmObserver]] = None)

Automatically determine the number of simulation replications needed
to achieve and maintain a target confidence interval precision.

Implements the *Replications Algorithm* from Hoad, Robinson & Davies
(2010), which combines:
  - The **confidence interval method** to assess whether the
    target precision has been met.
  - A **sequential look-ahead procedure** to verify that
    precision remains stable in additional replications.

Attributes
----------
alpha : float
    Significance level for confidence interval calculations.
half_width_precision : float
    Target CI half-width precision (i.e. percentage deviation of the
    confidence interval from the mean).
initial_replications : int
    Number of replications to run before evaluating precision.
look_ahead : int
    Number of additional replications to simulate for stability checks
    (adjusted proportionally when `n > 100`).
replication_budget : int
    Maximum number of replications allowed.
verbose : bool
    If True, prints the current replication count during execution.
observer_factory : callable or None
    Callable returning a new observer instance for each metric. Should
    be a function, lambda, or class constructor taking no arguments.
    Returned object must follow the `AlgorithmObserver` protocol.
    If None, uses `ReplicationTabulizer`.
n : int
    Current replication count (updated during execution).
_n_solution : int
    Solution replication count once convergence is met (or replication
    budget if not met).
stats : OnlineStatistics or None
    Tracks running mean, variance, and confidence interval metrics.

References
----------
Hoad, K., Robinson, S., & Davies, R. (2010). Automated selection of the
number of replications for a discrete-event simulation. *Journal of the
Operational Research Society*, 61(11), 1632-1644.
https://www.jstor.org/stable/40926090

DistributionRegistry()

Registry for probability distribution classes with batch creation
capabilities.

The DistributionRegistry provides a central repository for registering
distribution classes and instantiating them from configuration data. This
facilitates dynamic creation of distribution objects based on runtime
configuration, supporting scenarios like simulation models, statistical
analysis, or any application requiring configurable random distributions.

Key features:
- Register distribution classes with a simple decorator
- Create distribution instances from class names and parameters
- Batch create multiple distributions from a dictionary or list
- Automatic generation of statistically independent random seeds

Examples
--------
Register distribution classes:

>>> @DistributionRegistry.register()
... class Exponential:
...     def __init__(self, mean, random_seed=None):
...         self.rng = np.random.default_rng(random_seed)
...         self.mean = mean
...
...     def sample(self, size=None):
...         return self.rng.exponential(self.mean, size=size)
...
>>> @DistributionRegistry.register("uniform_dist")  # Custom name
... class Uniform:
...     def __init__(self, low, high, random_seed=None):
...         self.rng = np.random.default_rng(random_seed)
...         self.low = low
...         self.high = high
...
...     def sample(self, size=None):
...         return self.rng.uniform(self.low, self.high, size=size)

Create a distribution with parameters:

>>> exp_dist = DistributionRegistry.create("Exponential", mean=2.0)
>>> exp_dist.sample(5)  # Generate 5 samples

Create multiple distributions from a flat configuration:

>>> config = {
...     "arrivals": {
...         "class_name": "Exponential",
...         "params": {"mean": 5.0}
...     },
...     "service_times": {
...         "class_name": "uniform_dist",
...         "params": {"low": 1.0, "high": 3.0}
...     }
... }
>>> distributions = DistributionRegistry.create_batch(config,
...                                                    main_seed=12345)
>>> arrivals = distributions["arrivals"]
>>> service_times = distributions["service_times"]

Create distributions from a nested configuration:

>>> config = {
...     "call": {
...         "C1": {"class_name": "Exponential", "params": {"mean": 10.0}},
...         "C2": {"class_name": "Exponential", "params": {"mean": 8.0}},
...     },
...     "response_time": {
...         "C1": {
...             "class_name": "Lognormal",
...             "params": {"mean": 30.0, "stdev": 5.0},
...         },
...     },
... }
>>> dists = DistributionRegistry.create_batch(
...     config, main_seed=42, preserve_structure=True
... )
>>> dists["call"]["C1"].sample(5)
>>> dists["response_time"]["C1"].sample(5)

Notes
-----
When creating distributions in batch with a main_seed, each distribution
receives its own statistically independent seed derived from the main seed.
This ensures proper statistical independence between random number streams
while maintaining overall reproducibility through the main seed.

When `preserve_structure=True`, the output mirrors the shape of the
input configuration, so nested configs produce nested dicts of instances.
When `preserve_structure=False` (default), all leaf distributions are
flattened into a single dict whose keys are the path components joined by
underscores (e.g. `"call_C1"`).

With `sort=True` (default), keys at every nesting level are sorted
alphabetically before seeds are assigned. This ensures that seed
assignment is stable even if the insertion order of keys in the config
dict changes between runs.

NSPPThinning(data: pandas.DataFrame, interval_width: Optional[float] = None, random_seed1: Union[int, numpy.random.bit_generator.SeedSequence, NoneType] = None, random_seed2: Union[int, numpy.random.bit_generator.SeedSequence, NoneType] = None)

Non Stationary Poisson Process via Thinning.

Thinning is an acceptance-rejection approach to sampling
inter-arrival times (IAT) from a time-dependent distribution
where each time period follows its own exponential distribution.

This implementation takes mean inter-arrival times as inputs, making it
consistent with NumPy's exponential distribution parameterization.


## Abstract Classes

Abstract base classes


## Protocols

Protocol / structural-typing interfaces


Distribution(*args, **kwargs)

Distribution protocol defining the interface for probability distributions.

Any class implementing this protocol should provide a sampling mechanism
that generates random values according to a specific probability
distribution.

AlgorithmObserver(*args, **kwargs)

Protocol for observer classes used in ReplicationsAlgorithm.

Classes implementing this protocol should provide a `dev` attribute
to store observations, a method `update` to add new results, and a
`summary_table` method to summarize the stored replication statistics.

Attributes
----------
dev : List[Any]
    Collection of observed replication results.

Methods
-------
update(results) -> None
    Add an observation of a replication.

summary_table() -> pd.DataFrame
    Create a DataFrame summarising all recorded replication statistics.

ReplicationObserver(*args, **kwargs)

Interface (protocol) for observers that track simulation replication
results.

ReplicationsAlgorithmModelAdapter(*args, **kwargs)

Adapter pattern for the "Replications Algorithm".

All models that use ReplicationsAlgorithm must provide a
single_run(replication_number) interface.


## Functions

Utility functions


load_banks_et_al_nspp() -> pandas.DataFrame

Load example Non-stationary poisson process data from Banks et al.

The function reads in the mean inter-arrival times by interval.
The arrival rate (1/mean_iat) is calculated.

Returns
-------
pandas.DataFrame

is_integer(value: Any, name: str) -> None

Validates that a value is an integer.

is_non_negative(value: Any, name: str) -> None

Validates that a value is greater than or equal to 0.

is_numeric(value: Any, name: str) -> None

Validates that a value is a number (int or float).

is_ordered_pair(low: Any, high: Any, low_name: str = 'low', high_name: str = 'high') -> None

Validates that two values are in ascending order (low < high).

is_ordered_triplet(low: Any, middle: Any, high: Any, low_name: str = 'low', middle_name: str = 'middle', high_name: str = 'high') -> None

Validates that three values are in ascending order.

is_positive(value: Any, name: str) -> None

Validates that a value is positive (> 0).

is_positive_array(value: numpy.ndarray, name: str) -> None

Validates that all elements in the array are positive.

is_probability(value: Any, name: str) -> None

Validates that a value is a valid probability (between 0 and 1).

is_probability_vector(value: numpy.ndarray, name: str) -> None

Validates that the array is a valid probability vector.

spawn_seeds(n_streams: int, main_seed: Optional[int] = None)

Generate multiple statistically independent random seeds.

This function creates a set of SeedSequence objects that are guaranteed
to produce independent streams of random numbers. This is crucial for
ensuring that multiple random number generators don't produce correlated
outputs, which could bias simulation results.

Parameters
----------
n_streams : int
    The number of independent seed sequences to generate.
    Must be a positive integer.

main_seed : Optional[int], default=None
    Master seed that determines all generated sequences.
    If None, a random entropy source is used, making results
    non-reproducible across runs. Providing a value enables
    reproducible sequences.

Returns
-------
List[np.random.SeedSequence]
    A list of n_streams SeedSequence objects that can be used
    to initialize random number generators with independent streams.

Notes
-----
This approach is preferred over manually creating seeds because
it uses NumPy's entropy pool management to guarantee statistical
independence between streams, avoiding subtle correlations that
might occur with manually chosen seeds.

Examples
--------
>>> seeds = spawn_seeds(3, main_seed=12345)
>>> rng1 = np.random.default_rng(seeds[0])
>>> rng2 = np.random.default_rng(seeds[1])
>>> rng3 = np.random.default_rng(seeds[2])

validate(value: Any, name: str, *validators: Callable[[Any, str], NoneType]) -> None

Applies multiple validators to a value.

confidence_interval_method(replications: Union[pandas.Series, pandas.DataFrame, Sequence[float], Sequence[Sequence[float]], Dict[str, Sequence[float]]], alpha: Optional[float] = 0.05, desired_precision: Optional[float] = 0.1, min_rep: Optional[int] = 5, decimal_places: Optional[int] = 2)

Determine the minimum number of simulation replications required to achieve
a target precision in the confidence interval of one or several performance
metrics.

This function applies the **confidence interval method**: it identifies the
smallest replication count where the relative half-width of the confidence
interval is less than the specified `desired_precision` for each metric.

Parameters
----------
replications: array-like, pd.Series, pd.DataFrame, list, or dict
    Replication results for one or more performance metrics. Accepted
    formats:
    - `pd.Series` or 1D list/numpy array → single metric
    - `pd.DataFrame` → multiple metrics in columns
    - `dict[str, list/array/Series]` → {metric_name: replications}
    - list of lists / numpy arrays / Series → multiple metrics unnamed
    Each inner sequence/Series/numpy array must contain numeric replication
    results in the order they were generated.
alpha: float, optional (default=0.05)
    Significance level for confidence interval calculations
    (CI level = 100 * (1 - alpha) %).
desired_precision: float, optional (default=0.1)
    Target CI half-width precision (i.e. percentage deviation of the
    confidence interval from the mean).
min_rep: int, optional (default=5)
    Minimum number of replications to consider before evaluating precision.
    Helps avoid unstable early results.
decimal_places: int, optional (default=2)
    Number of decimal places to round values in the returned results table.

Returns
-------
- Single-metric input → tuple `(n_reps, results_df)`
- Multi-metric input → dict:
    `{metric_name: (n_reps, results_df)}`
Where:
    n_reps : int
        The smallest number of replications achieving the desired
        precision. Returns -1 if precision is never reached.
    results_df : pandas.DataFrame
        Summary statistics at each replication:
        "Mean", "Cumulative Mean", "Standard Deviation",
        "Lower Interval", "Upper Interval", "% deviation"

Warns
-----
UserWarning
    Issued per metric if the desired precision is never reached.

plotly_confidence_interval_method(n_reps, conf_ints, metric_name, figsize=(1200, 400), shaded=True)

Create an interactive Plotly visualisation of the cumulative mean and
confidence intervals for each replication.

This plot displays:
  - The running (cumulative) mean of a performance metric.
  - Lower and upper bounds of the confidence interval at each replication.
  - Annotated deviation (as % of mean) on hover.
  - A vertical dashed line at the minimum number of replications (`n_reps`)
    required to achieve the target precision.

Parameters
----------
n_reps: int
    Minimum number of replications needed to achieve desired precision
    (typically the output of `confidence_interval_method`).
conf_ints: pandas.DataFrame
    Results DataFrame from `confidence_interval_method`, containing
    columns: `"Cumulative Mean"`, `"Lower Interval"`, `"Upper Interval"`,
    etc.
metric_name: str
    Name of the performance metric displayed in the y-axis label.
figsize: tuple, optional (default=(1200,400))
    Figure size in pixels: (width, height).
shaded: bool, optional
    If True, use shaded CI region. If False, use dashed lines (legacy).

Returns
-------
plotly.graph_objects.Figure

nspp_plot(arrival_profile: pandas.DataFrame, run_length: Optional[float] = None, n_reps: Optional[int] = 1000) -> Tuple[matplotlib.figure.Figure, matplotlib.axes._axes.Axes]

Generate a matplotlib chart to visualise a non-stationary poisson process
for the set arrival profile.

This uses the sim-tools NSPPThinning class.

Useful for validating the the NSPP has been set up correctly and is
producing the desired profile for the simulation model.

Parameters
----------
arrival_profile: pandas.DataFrame
    The arrival profile is a pandas data frame containing 't',
    'arrival_rate' and 'mean_iat' columns.

run_length: float, optional (default=None)
    How long should the simulation be run. If none then uses the last value
    in 't' + the interval (assumes equal width intervals)

n_reps: int, optional (default=1000)
    The number of replications to run.

nspp_simulation(arrival_profile: pandas.DataFrame, run_length: Optional[float] = None, n_reps: Optional[int] = 1000) -> pandas.DataFrame

Generate a pandas dataframe that contains multiple replications of
a non-stationary poisson process for the set arrival profile.

This uses the sim-tools NSPPThinning class.

Useful for validating the the NSPP has been set up correctly and is
producing the desired profile for the simulation model.

On each replication the function counts the number of arrivals during the
intervals from the arrival profile.  Returns a data frame with reps (rows)
and interval arrivals (columns).

Parameters
----------
arrival_profile: pandas.DataFrame
    The arrival profile is a pandas data frame containing 't',
    'arrival_rate' and 'mean_iat' columns.

run_length: float, optional (default=None)
    How long should the simulation be run. If none then uses the last
    value in 't' + the interval (assumes equal width intervals)

n_reps: int, optional (default=1000)
    The number of replications to run.

Returns
-------
pd.DataFrame.


## Constants

Module-level constants and data


datasets.FILE_NAME_NSPP_1

str(object='') -> str
str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to 'strict'.

datasets.PATH_NSPP_1

Path subclass for non-Windows systems.

On a POSIX system, instantiating a Path should return this object.

T

Type variable.

Usage::

  T = TypeVar('T')  # Can be anything
  A = TypeVar('A', str, bytes)  # Must be str or bytes

Type variables exist primarily for the benefit of static type
checkers.  They serve as the parameters for generic types as well
as for generic function definitions.  See class Generic for more
information on generic types.  Generic functions work as follows:

  def repeat(x: T, n: int) -> List[T]:
      '''Return a list containing n references to x.'''
      return [x]*n

  def longest(x: A, y: A) -> A:
      '''Return the longest of two strings.'''
      return x if len(x) >= len(y) else y

The latter example's signature is essentially the overloading
of (str, str) -> str and (bytes, bytes) -> bytes.  Also note
that if the arguments are instances of some subclass of str,
the return type is still plain str.

At runtime, isinstance(x, T) and issubclass(C, T) will raise TypeError.

Type variables defined with covariant=True or contravariant=True
can be used to declare covariant or contravariant generic types.
See PEP 484 for more details. By default generic types are invariant
in all type variables.

Type variables can be introspected. e.g.:

  T.__name__ == 'T'
  T.__constraints__ == ()
  T.__covariant__ == False
  T.__contravariant__ = False
  A.__constraints__ == (str, bytes)

Note that only type variables defined in global scope can be pickled.

output_analysis.ALG_INTERFACE_ERROR

str(object='') -> str
str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to 'strict'.

output_analysis.OBSERVER_INTERFACE_ERROR

str(object='') -> str
str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to 'strict'.


## Other

Additional exports


sim_tools.ovs


----------------------------------------------------------------------
This is the User Guide documentation for the package.
----------------------------------------------------------------------

### User Guide

The User Guide is organised into four themes:

* **Sampling:** Generate reproducible random input for your models.

* **Debugging:** Lightweight tracing to help you understand and fix simulation logic.

* **Output analysis:** Tools to help you plan the number of replications.

* **Optimisation:** Optimisation via simulation procedures.


### Theoretical distributions

This notebook provides an overview of how to use the `Distribution` classes from the `sim_tools.distributions` module.

The classes are a higher level abstract of sampling from the numpy library. They encapsulate the distribution parameters, a random number generator instance, and a random seed.  For example, the code for the Exponential class is:

```{.python}
class Exponential:
    """
    Exponential distribution implementation.

    A probability distribution that models the time between events in a Poisson process,
    where events occur continuously and independently at a constant average rate.

    This class conforms to the Distribution protocol and provides methods to sample
    from an exponential distribution with a specified mean.
    """

    def __init__(
        self,
        mean: float,
        random_seed: Optional[Union[int, SeedSequence]] = None,
    ):
        """
        Initialize an exponential distribution.

        Parameters
        ----------
        mean : float
            The mean of the exponential distribution.
            Must be positive.

        random_seed : Optional[Union[int, SeedSequence]], default=None
            A random seed or SeedSequence to reproduce samples. If None, a unique
            sample sequence is generated.
        """
        self.rng = np.random.default_rng(random_seed)
        self.mean = mean

    def sample(
        self, size: Optional[Union[int, Tuple[int, ...]]] = None
    ) -> Union[float, NDArray[np.float64]]:
        """
        Generate random samples from the exponential distribution.

        Parameters
        ----------
        size : Optional[Union[int, Tuple[int, ...]]], default=None
            The number/shape of samples to generate:
            - If None: returns a single sample as a float
            - If int: returns a 1-D array with that many samples
            - If tuple of ints: returns an array with that shape

        Returns
        -------
        Union[float, NDArray[np.float64]]
            Random samples from the exponential distribution:
            - A single float when size is None
            - A numpy array of floats with shape determined by size parameter
        """
        return self.rng.exponential(self.mean, size=size)
    

```

Here we can see that the `__init__` method accepts the distribution mean plus a random seed to control sampling from a `np.random.Generator` object.  The creation of the generator is done by the super class `Distribution`

The `sample` method delegates work to the `np.random.Generator` object.

The abstraction into a class enables the distribution to be used as a simple parameter in a model.

For some classes there are no direct mappings to `numpy`.  Samples from these distributions are typically derived from other distributions. For example, `PearsonV` is an inverse gamma distribution.

```{.python}
class PearsonV:
    """
    Pearson Type V distribution implementation (inverse Gamma distribution).

    Where alpha = shape, and beta = scale (both > 0).

    Law (2007, pg 293-294) defines the distribution as
    PearsonV(alpha, beta) = 1/Gamma(alpha, 1/beta) and notes that the
    PDF is similar to that of lognormal, but has a larger spike
    close to 0. It can be used to model the time to complete a task.

    For certain values of the shape parameter the mean and variance can be
    directly computed:

    mean = beta / (alpha - 1) for alpha > 1.0
    var = beta^2 / (alpha - 1)^2 × (alpha - 2) for alpha > 2.0

    This class conforms to the Distribution protocol.

    Alternative Sources:
    --------------------
    [1] https://riskwiki.vosesoftware.com/PearsonType5distribution.php
    [2] https://modelassist.epixanalytics.com/display/EA/Pearson+Type+5

    Notes:
    ------
    A good R package for Pearson distributions is PearsonDS
    https://www.rdocumentation.org/packages/PearsonDS/versions/1.3.0
    """

    def __init__(
        self,
        alpha: float,
        beta: float,
        random_seed: Optional[Union[int, SeedSequence]] = None,
    ):
        """
        Initialize a Pearson Type V distribution.

        Parameters
        ----------
        alpha : float
            Shape parameter. Must be > 0.

        beta : float
            Scale parameter. Must be > 0.

        random_seed : Optional[Union[int, SeedSequence]], default=None
            A random seed or SeedSequence to reproduce samples. If None, a unique
            sample sequence is generated.

        Raises
        ------
        ValueError
            If alpha or beta are not positive.
        """
        if alpha <= 0 or beta <= 0:
            raise ValueError("alpha and beta must be > 0")

        self.rng = np.random.default_rng(random_seed)
        self.alpha = alpha  # shape
        self.beta = beta  # scale

    def mean(self) -> float:
        """
        Calculate the mean of the Pearson Type V distribution.

        Returns
        -------
        float
            The theoretical mean of this distribution.

        Raises
        ------
        ValueError
            If alpha <= 1.0, as the mean is not defined in this case.
        """
        if self.alpha > 1.0:
            return self.beta / (self.alpha - 1)
        msg = "Cannot directly compute mean when alpha <= 1.0"
        raise ValueError(msg)

    def var(self) -> float:
        """
        Calculate the variance of the Pearson Type V distribution.

        Returns
        -------
        float
            The theoretical variance of this distribution.

        Raises
        ------
        ValueError
            If alpha <= 2.0, as the variance is not defined in this case.
        """
        if self.alpha > 2.0:
            return (self.beta**2) / (
                ((self.alpha - 1) ** 2) * (self.alpha - 2)
            )
        msg = "Cannot directly compute var when alpha <= 2.0"
        raise ValueError(msg)

    def sample(
        self, size: Optional[Union[int, Tuple[int, ...]]] = None
    ) -> Union[float, NDArray[np.float64]]:
        """
        Generate random samples from the Pearson Type V distribution.

        Parameters
        ----------
        size : Optional[Union[int, Tuple[int, ...]]], default=None
            The number/shape of samples to generate:
            - If None: returns a single sample as a float
            - If int: returns a 1-D array with that many samples
            - If tuple of ints: returns an array with that shape

        Returns
        -------
        Union[float, NDArray[np.float64]]
            Random samples from the Pearson Type V distribution:
            - A single float when size is None
            - A numpy array of floats with shape determined by size parameter
        """
        return 1 / self.rng.gamma(self.alpha, 1 / self.beta, size)
```

## Summary of implemented distributions

1. **Exponential**: This class is used for modeling the time between events in a Poisson process, where events occur continuously and independently at a constant average rate. It takes the mean of the distribution and an optional random seed as parameters.

2. **Bernoulli**: This class models a process where an event has only two possible outcomes (success or failure). It takes the probability of success and an optional random seed as parameters.

3. **Lognormal**: This class is used when the logarithm of a variable is normally distributed. It takes the mean and standard deviation of the lognormal distribution and an optional random seed as parameters.

4. **Normal**: This class models a process where a variable is normally distributed. It takes the mean and standard deviation of the normal distribution, a boolean to allow or disallow negative samples, and an optional random seed as parameters.

5. **Uniform**: This class models a process where all outcomes are equally likely. It takes the lower and upper range of the uniform distribution and an optional random seed as parameters.

6. **Triangular**: This class models a process where a variable is most likely to be at the mode (peak) and the probabilities decrease uniformly on either side. It takes the lower limit, mode, upper limit, and an optional random seed as parameters.

7. **FixedDistribution**: This class returns a fixed value, useful when a constant value is needed in a simulation. It takes a fixed value as a parameter.

8. **CombinationDistribution**: This class is used to model a process where the outcome is the sum of outcomes from multiple distributions. It takes multiple distribution instances as parameters.

9. **GroupedContinuousEmpirical**: This class is used to model a process based on observed data, where the probability of an outcome is proportional to its frequency in the data. To produce a continuous value, linear interpolation between bounds is used. It takes lower bounds (of a discrete empirical distribution), upper bounds, observed frequencies, and an optional random seed as parameters.

10. **Erlang**: This class is used for modeling the time between events in a Poisson process, where events occur continuously and independently at a constant average rate. The class is implemented to allow for users to input the mean, and stdev of the distribution as opposed to k and theta. This allows for compatibility of parameters with other commerical simulation packages. The mean and stdev are converted to k and theta internally. The Erlang is a special case of the gamma distribution where  is an integer.  Internally this is implemented using numpy Generator's gamma method. As k is calculated from the mean and stdev it is rounded to an integer value using python's built in 'round' function. Optionally a user can offet the original of the distribution using the location parameter.

11. **ErlangK**: The same as **Erlang**, but `k` and `theta` are specified directly by a user. 

12. The **Weibull** distribution class is characterized by a shape parameter (alpha) and a scale parameter (beta), both of which must be greater than zero. The scale parameter influences the variance of the samples. An optional location parameter can shift the distribution to set a lower bound. This distribution is useful for modeling the life duration of materials and systems, where the failure rate is not constant over time[1].

13. The **Gamma** distribution class is defined by an alpha (scale) and beta (shape) parameter. It includes methods to compute the mean and variance, as well as a static method to determine alpha and beta from a specified mean and variance. The Gamma distribution is often used to model waiting times for multiple events in a Poisson process.

14. The **Beta** distribution class accepts two shape parameters (alpha1 and alpha2) and can be rescaled to a specific range using lower and upper bounds. It is commonly used to model random variables that are constrained to an interval, such as proportions and percentages.

15. **DiscreteEmpirical** distribution class allows sampling from a set of values with given observed frequencies. It is particularly useful for modeling scenarios where outcomes have specific probabilities, such as routing percentages or entity classes.

16. **TruncatedDistribution** class can truncate any given distribution at a specified lower bound. It is useful when modeling variables that have a natural lower limit, ensuring that no sampled values fall below this threshold.

17. The **RawDiscreteEmpirical** distribution class enables sampling with replacement from a list of raw empirical values. This class is beneficial when no theoretical distribution fits the observed data well, and a non-parametric approach is preferred.

18. The **PearsonV** (Pearson type 5) distribution, also known as the inverse Gamma distribution, is characterized by two parameters: alpha (shape) and beta (scale), both of which must be greater than zero. It is particularly useful for modeling time delays where there is a minimum delay and the maximum delay is unbounded. The mean and variance of the Pearson Type V distribution can be directly computed for certain values of the shape parameter.

19. The **PearsonVI** (Pearson Type 6) distribution, an inverted beta distribution, is defined by three parameters: alpha1 (shape parameter 1), alpha2 (shape parameter 2), and beta (scale parameter), all of which must be greater than zero. The Pearson Type VI distribution can be used to model the time to complete tasks and is sampled using the transformation Pearson6(a1, a2, b) = b * X / (1 - X), where X is a Beta-distributed random variable with parameters alpha1 and alpha2. The mean and variance can be computed directly for certain values of the second shape parameter.

20. The **Hyperexoponential** distribution.  A continuous probability distribution that is a mixture (weighted sum) of exponential distributions. It has a higher coefficient of variation than a single exponential distribution, making it useful for modeling highly variable processes or heavy-tailed phenomena.

## Example output

The code below creates an instance of the distributions on offer. Here we used the `size` parameter in `sample()` to produce a large number of samples to allow for visualisation.

```{python}
# pylint: disable=missing-module-docstring,invalid-name

# Import package to generate plots
import matplotlib.pyplot as plt

# Import a selection of distributions
from sim_tools.distributions import (
    Exponential,
    Poisson,
    Bernoulli,
    Lognormal,
    Normal,
    Uniform,
    Triangular,
    FixedDistribution,
    CombinationDistribution,
    GroupedContinuousEmpirical,
    Erlang,
    ErlangK,
    Weibull,
    Gamma,
    Beta,
    DiscreteEmpirical,
    PearsonV,
    PearsonVI,
    Hyperexponential,
)
```

### Create plots

```{python}
# Create an instance of each distribution and generate 10,000 samples
distributions = [
    Exponential(mean=1.0, random_seed=42),
    Poisson(rate=1.0, random_seed=42),
    Erlang(mean=32.0, stdev=9.26, location=2.83, random_seed=42),
    ErlangK(k=3, theta=1.0, location=0.0, random_seed=42),
    Bernoulli(p=0.3, random_seed=42),
    Lognormal(mean=10, stdev=1, random_seed=42),
    Normal(mean=0, sigma=1, minimum=None, random_seed=42),
    Uniform(low=0, high=1, random_seed=42),
    Triangular(low=0, mode=0.5, high=1, random_seed=42),
    FixedDistribution(value=5),
    CombinationDistribution(Exponential(mean=1.0), Normal(mean=0, sigma=1)),
    Hyperexponential([0.5, 0.5], [1.0, 0.2], random_seed=42),
    Weibull(alpha=1.93, beta=19.15, random_seed=42),
    Gamma(alpha=2.84, beta=7.42, location=0.0, random_seed=42),
    Beta(alpha1=1.32, alpha2=2.56, lower_bound=10.0, upper_bound=15.0, random_seed=42),
    DiscreteEmpirical(values=[1, 2, 3], freq=[95, 3, 2], random_seed=42),
    PearsonV(alpha=10.0, beta=2.5, random_seed=42),
    PearsonVI(alpha1=10.0, alpha2=5.0, beta=0.85, random_seed=42),
]

fig, axs = plt.subplots(len(distributions) // 2, 2, figsize=(15, 15))

for i, distribution in enumerate(distributions):
    row = i // 2
    col = i % 2
    samples = distribution.sample(size=100_000)
    axs[row, col].hist(samples, bins=30, alpha=0.7)
    axs[row, col].set_title(distribution.__class__.__name__)

fig.tight_layout()
plt.show()
```

## Single samples

```{python}
arrival_dist = Exponential(mean=30, random_seed=42)
arrival_dist.sample()
```

```{python}
activity_time = GroupedContinuousEmpirical(
    lower_bounds=[0, 5, 10, 15, 30, 45, 60, 120, 180, 240, 480],
    upper_bounds=[5, 10, 15, 30, 45, 60, 120, 180, 240, 480, 2880],
    freq=[34, 4, 8, 13, 15, 13, 19, 13, 9, 12, 73],
    random_seed=42,
)

activity_time.sample()
```


## Overview

This notebook demonstrates how to use the `DistributionRegistry` class for managing probability distributions in `sim-tools`.

The `DistributionRegistry` aims to support simulation modellers set their models up to easily be parameterised with statistical distributions.  It offers the following features:

- **Configuration flexibility**: Define distributions through simple dictionaries or JSON files
- **Reproducibility**: Generate statistically independent random streams with controlled seeds
- **Extensibility**: Easily add new distribution types without changing existing code
- **Batch creation**: Create multiple related distributions with a single command

The approach is useful for building discrete event simulations, Monte Carlo analyses, statistical models, or any system that requires configurable random behavior. The registry pattern provides a clean, maintainable approach that separates configuration from implementation.

In this notebook, we'll cover:

1. Importing the `DistributionRegistry`.
2. Creating distributions from configuration data.
3. Working with JSON configurations.
4. Define and add your own distribution to the registry.
5. Using the `sort` argument in `create_batch`.

## 1. Importing

The `DistributionRegistry` is part of the `distributions` module.  You can think of the registry as a factory that takes distribution orders and delivers ready to use distribution objects.  We do not need to create an instance of the `DistributionRegistry`. Instead we will use its class methods.


```{python}
# pylint: disable=missing-module-docstring,invalid-name
import json

import numpy as np
from sim_tools.distributions import DistributionRegistry
```

## 2. Creating batches of distributions from configuration data

`DistributionRegistry` allows a user to create a batch of statistical distributions in one go using `create_batch`.  A user can either pass in a `dict` or a `list`.

> `DistributionRegistry` sets up non-overlapping pseudo random number streams for the distributions using the `distributions.spawn_seeds()` function. For reproducibile results a user should se the `main_seed` parameter of `create_batch`.

```{python}
# Dictionary-based configuration (named distributions)
config_dict = {
    "customer_arrivals": {"class_name": "Exponential", "params": {"mean": 4.5}},
    "service_times": {"class_name": "Normal", "params": {"mean": 10.0, "sigma": 2.0}},
    "satisfaction_scores": {
        "class_name": "Triangular",
        "params": {"low": 1.0, "high": 10.0, "mode": 8.0},
    },
}

# Create all distributions with a master seed
distributions = DistributionRegistry.create_batch(config_dict, main_seed=12345)

# Access distributions by name
arrivals = distributions["customer_arrivals"]
service = distributions["service_times"]
satisfaction = distributions["satisfaction_scores"]

print("Created distributions:")
for name, dist in distributions.items():
    print(f"- {name}: {dist}")

# Generate samples from all distributions
for name, dist in distributions.items():
    print(f"{name} samples: {dist.sample()}")
```

```{python}
# List-based configuration (unnamed distributions)
config_list = [
    {"class_name": "Exponential", "params": {"mean": 2.0}},
    {"class_name": "Uniform", "params": {"low": 0.0, "high": 1.0}},
]

# Create distributions from the list
dist_list = DistributionRegistry.create_batch(config_list, main_seed=54321)

# Access by index
print("\nList-based distributions:")
for i, dist in enumerate(dist_list):
    print(f"Distribution {i}: {dist}")
    print(f"Sample: {dist.sample()}")
```

## 3. Working with JSON Configurations

JSON is a natural format for storing distribution configurations for your model.  First we will create an example JSON file.

```{python}
# Example JSON configuration
json_config = """
{
    "simulation_parameters": {
        "customer_arrivals": {
            "class_name": "Exponential",
            "params": {"mean": 5.0}
        },
        "checkout_times": {
            "class_name": "Lognormal",
            "params": {"mean": 3.5, "stdev": 0.8}
        },
        "browse_duration": {
            "class_name": "Triangular",
            "params": {"low": 2.0, "high": 45.0, "mode": 15.0}
        },
        "purchase_amount": {
            "class_name": "Normal",
            "params": {"mean": 75.0, "sigma": 2.0}
        }
    }
}
"""

# Parse the JSON
config_data = json.loads(json_config)

# write JSON to file
with open("example_sim_config.json", "w", encoding="utf-8") as f:
    json.dump(json_config, f, indent=4)
```

Let's assume we want to run out model with the distributions specified in the JSON file.  We need to

1. Load and parse the JSON file
2. Extract the simulation parameters
3. Batch create the distributions

```{python}
# read in file
with open("example_sim_config.json", "r", encoding="utf-8") as f:
    config = json.load(f)

# Extract the distributions configuration
distributions_config = config_data["simulation_parameters"]

# Create all distributions with reproducible seed
distributions = DistributionRegistry.create_batch(distributions_config, main_seed=42)

# Now we can use these distributions in a simulation
print("JSON-configured distributions:")
for name, dist in distributions.items():
    print(f"- {name}: {dist}")
    print(f"  Sample: {dist.sample()}")
```

### Using the built in JSON template

`DistributionRegistry` also contains a built in template to enable quick setup of the distributions you want to use in your model.  Access it via the `get_template()` function. Set `format` to `"json"`.  

> Note: The template contains all distributions in the registry with example parameters. An example of each type is included once. You will need to modify it before passing to `batch_create`.

```{python}
# Generate a JSON template
template_json = DistributionRegistry.get_template(format="json")
print("\nJSON Template (first 200 characters):")
print(template_json[:200] + "...")
```

```{.python}
# You can save the template to a file for reference
with open("distribution_template.json", "w", encoding="utf-8") as f:
    f.write(template_json)
```

## 4. Define and add your own distribution to the registry

If `sim_tools` does not contain the distribution you need you can still make use of the `DistributionRegistry` by registering your own custom class.  Instances of the class can then be created in the same way as standard `sim_tools` distributions.

```{python}
# Define and register common probability distributions
@DistributionRegistry.register()
class CustomDistribution:
    """My own custom distribution"""

    def __init__(self, param1, param2, random_seed=None):
        """Initialise"""
        self.param1 = param1
        self.param2 = param2
        self.rng = np.random.default_rng(random_seed)

    def __repr__(self):
        """Print parameters."""
        return f"CustomDistribution({self.param1}, {self.param2})"

    def sample(self):
        """Sample - replace with any sampling mechanism."""
        return 1.0
```

```{python}
# Dictionary-based configuration (named distributions)
config_dict = {
    "customer_arrivals": {"class_name": "Exponential", "params": {"mean": 4.5}},
    "service_times": {
        "class_name": "CustomDistribution",
        "params": {"param1": 10.0, "param2": 2.0},
    },
    "treatment_times": {
        "class_name": "CustomDistribution",
        "params": {"param1": 25.0, "param2": 3.0},
    },
}

# Create all distributions with a master seed
distributions = DistributionRegistry.create_batch(config_dict, main_seed=12345)

print("Created distributions:")
for name, dist in distributions.items():
    print(f"- {name}: {dist}")

# Generate samples from all distributions
for name, dist in distributions.items():
    print(f"{name} samples: {dist.sample()}")
```

## 5. `sort` argument in `DistributionRegistry.create_batch()`

The `sort` argument controls whether named distributions (when a config dictionary is used) are ordered by key before assigning random seeds.

If `sort=True` (default), distributions always get the same seeds as long as their names/definitions are unchanged, regardless of insertion order - improving reproducibility when config order is unstable (e.g., from JSON files).

If `sort=False`, seeds are based directly on the config's item order—this can result in different streams if dictionary ordering changes.

**Example:**

```{python}
config = {
    "B": {"class_name": "Exponential", "params": {"mean": 2.0}},
    "A": {"class_name": "Normal", "params": {"mean": 10.0, "sigma": 1.0}},
}

# Example with sort=True (default)
dists_sorted = DistributionRegistry.create_batch(config, main_seed=42, sort=True)
print("Sort=True: ", dists_sorted)

# Example with sort=False (order depends on dict)
dists_unsorted = DistributionRegistry.create_batch(config, main_seed=42, sort=False)
print("Sort=False: ", dists_unsorted)
```

**Why use `sort=True`?**

* **Ensures reproducibility**: named distributions always get the same random seeds, regardless of how the configuration dictionary is ordered.
* **Avoids subtle, hard-to-diagnose bugs**: Without sorting, changing the order of keys (e.g., by editing a JSON file) can silently change the random seeds, and thus your simulation results, unless you use `sort=True`.


### Empirial Distributions

In many applications of simulation, theoretical distributons may not be a good fit to the data; for example, if the data are bimodal. A number of empirical distribution options are provided by `sim-tools` to support modelling in these situations

* `RawContinuousEmpirical` 
* `RawDiscreteEmpirical`
* `GroupedContinuousEmpirical`
* `DiscreteEmpirical`


```{python}
# pylint: disable=missing-module-docstring,invalid-name
import math
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sim_tools.distributions import (
    RawContinuousEmpirical,
    GroupedContinuousEmpirical,
    RawDiscreteEmpirical,
    DistributionRegistry,
)
```

## 1. Sampling from the raw data with interpolation

If a user has access to the raw sample data they can pass this to `RawContinuousEmpirical`.  The class implements the method described in Law and Kelton's *Simulation Modeling and Analysis*. The approach creates a piecewise linear model where interpolation is used for gaps between data samples.  The maximum and minimum values of the sample are the bounds of the distribution.

### 1.1 A simple example

The example below illustrates the difference between the standard Empirical Cumulative Distribution Function (a stepped function) and Law and Kelton's method of linear interpolation between points that is used in the sampling.

```{python}
# Generate a simple dataset with 10 values representing length of stay in days
# Values chosen to be spread out to clearly show linear lines between points
simple_los = np.array([1.0, 3.5, 5.0, 7.2, 10.0, 12.5, 15.0, 18.0, 22.0, 30.0])

simple_dist = RawContinuousEmpirical(simple_los)

# plot the standard ECDF
fig = simple_dist.plotly_ecdf_standard(showlegend=False)

# Display the plot
fig.show()
```

```{python}
# Create the empirical distribution
simple_dist = RawContinuousEmpirical(data=simple_los, random_seed=42)

# plot the linear interpolation
fig = simple_dist.plotly_ecdf_linear_interpolation()

# Display the plot
fig.show()
```

### 1.2 More complex data

We will generate four datasets to illustrate the distributon in use.  One is approx uniform , one right skewed (exponential), one bimodal (a mix of two normal distributions) and finally one with outliers.

```{python}
# Create a dictionary of distribution configurations for batch creation
dist_configs = {
    # 1. Basic uniform distribution for Hospital A
    "uniform_dist": {"class_name": "Uniform", "params": {"low": 0.5, "high": 45.5}},
    # 2. Exponential distribution for Hospital B
    "exponential_dist": {"class_name": "Exponential", "params": {"mean": 5}},
    # 3. Normal distributions for bimodal Hospital C
    "normal_short": {"class_name": "Normal", "params": {"mean": 3, "sigma": 1}},
    "normal_long": {"class_name": "Normal", "params": {"mean": 20, "sigma": 6}},
    # 4. Distributions for the realistic Hospital D
    "lognormal_dist": {
        "class_name": "Lognormal",
        "params": {"mean": 5.4, "stdev": 3.5},
    },
    "uniform_outliers": {"class_name": "Uniform", "params": {"low": 30, "high": 45}},
}

# Create distributions using the DistributionRegistry create_batch method
# Setting only the main_seed parameter - the class handles individual seeds
# automatically
distributions = DistributionRegistry.create_batch(dist_configs, main_seed=42)

# Sample from distributions to create our hospital length of stay data
# 1. Hospital A - uniform distribution
los_uniform = distributions["uniform_dist"].sample(100)

# 2. Hospital B - exponential distribution with offset and clipping
los_skewed = distributions["exponential_dist"].sample(100) + 0.5
los_skewed = np.clip(los_skewed, 0.5, 45.5)

# 3. Hospital C - bimodal distribution (combining two normal distributions)
los_bimodal = np.concatenate(
    [
        # Short stays centered around 3 days
        distributions["normal_short"].sample(70),
        # Longer stays centered around 20 days
        distributions["normal_long"].sample(30),
    ]
)
los_bimodal = np.clip(los_bimodal, 0.5, 45.5)  # Clip to our desired range

# 4. Hospital D - realistic distribution with outliers
los_realistic = np.concatenate(
    [
        # Main distribution
        distributions["lognormal_dist"].sample(95),
        # Few outliers with very long stays
        distributions["uniform_outliers"].sample(5),
    ]
)
los_realistic = np.clip(los_realistic, 0.5, 45.5)  # Clip to our desired range

# Create a DataFrame to organize our data
hospital_data = pd.DataFrame(
    {
        "Hospital A": los_uniform,
        "Hospital B": los_skewed,
        "Hospital C": los_bimodal,
        "Hospital D": los_realistic,
    }
)

# A summary of the synthetic dataset
hospital_data.describe()
```

The function `create_combined_plotly_figure` is included as a helper to merge four plots of the hospital datasets together.  

```{python}
# pylint: disable=too-many-arguments,too-many-positional-arguments,too-many-locals
def create_combined_plotly_figure(
    distributions_dict,
    plot_method_name="plotly_ecdf_linear_interpolation",
    cols=2,
    plot_args=None,  # Dictionary to pass arguments to the plot method
    subplot_title_prefix="",
    main_title="Combined Distribution Plots",
    share_xaxes=False,
    share_yaxes=False,
    layout_options=None,  # Dictionary for fig.update_layout
) -> go.Figure:
    """
    Combines multiple Plotly figures from a dictionary of distributions
    into a single figure with subplots.

    Parameters
    ----------
    distributions_dict : dict
        A dictionary where keys are names (e.g., 'Hospital A') and values
        are instances of RawContinuousEmpirical (or similar).
    plot_method_name : str, default="plot_ecdf_linear_interpolation_plotly"
        The name of the method on the distribution object that returns
        a Plotly figure (e.g., 'plotly_ecdf_standard_plot').
    cols : int, default=2
        Number of columns in the subplot grid. Rows are calculated
        automatically.
    plot_args : dict, optional
        Keyword arguments to pass to the individual plotting method specified
        by plot_method_name.
    subplot_title_prefix : str, default=""
        A prefix to add before the distribution name in subplot titles.
    main_title : str, default="Combined Distribution Plots"
        The main title for the combined figure.
    share_xaxes : bool, default=False
        Whether to share x-axes across subplots [2].
    share_yaxes : bool, default=False
        Whether to share y-axes across subplots [2].
    layout_options : dict, optional
        Additional layout options to apply to the final figure using
        fig.update_layout().

    Returns
    -------
    plotly.graph_objects.Figure
        The combined Plotly figure with subplots.
    """
    if not distributions_dict:
        return go.Figure().update_layout(title_text="No distributions provided")

    if plot_args is None:
        plot_args = {}

    dist_names = list(distributions_dict.keys())
    num_plots = len(dist_names)
    rows = math.ceil(num_plots / cols)

    subplot_titles = [f"{subplot_title_prefix}{name}" for name in dist_names]

    # Create the subplot figure structure [2, 3]
    fig = make_subplots(
        rows=rows,
        cols=cols,
        subplot_titles=subplot_titles,
        shared_xaxes=share_xaxes,
        shared_yaxes=share_yaxes,
    )

    # Iterate through distributions, generate plots, and add traces [1, 4]
    for i, name in enumerate(dist_names):
        distribution = distributions_dict[name]
        plot_method = getattr(distribution, plot_method_name)

        # Generate the individual figure for this distribution
        # Pass any specific plot arguments
        individual_fig = plot_method(**plot_args)

        # Calculate subplot position (1-based index)
        current_row = i // cols + 1
        current_col = i % cols + 1

        # Add each trace from the individual figure to the main subplot
        # figure [1, 5]
        for trace in individual_fig.data:
            # Preserve legend grouping if the individual plot used it
            trace.legendgroup = name
            # Optionally hide duplicate legend entries if desired,
            # though Plotly often handles this with legendgroup
            if i > 0:
                # Keep legend status consistent
                trace.showlegend = plot_args.get("showlegend", True)

            fig.add_trace(trace, row=current_row, col=current_col)

    # Update overall layout
    fig.update_layout(title_text=main_title)
    # Hide duplicate legends if traces weren't automatically grouped well
    # fig.update_layout(showlegend=True) # Or False if too cluttered

    if layout_options:
        fig.update_layout(layout_options)

    return fig
```

```{python}
# create the datasets
distributions = {}
for hospital in hospital_data.columns:
    distributions[hospital] = RawContinuousEmpirical(
        data=hospital_data[hospital], random_seed=42
    )

# plot the linear interpolation function
plot_args = {"showlegend": False}
combined_fig = create_combined_plotly_figure(
    distributions,
    plot_method_name="plotly_ecdf_linear_interpolation",
    plot_args=plot_args,
    cols=2,
    main_title="Linear Interpolation ECDFs",
)
combined_fig.show()
```

Lastly, we can check the sampling is working as expected using the function `plotly_original_versus_sampled`. This will compare the original raw data with the samples in a histogram.  Change the hospital name to see a different chart.

```{python}
def plotly_original_versus_sampled(sample_dict, original_data):
    """
    Creates a Plotly chart with a dropdown to switch between datasets in
    sample_dict.

    Parameters
    ----------
    sample_dict : dict
        Dictionary where keys are dataset names and values are the data.
    original_data : dict
        Dictionary containing the original data. Same keys as sample_dict.

    Returns
    -------
    plotly.graph_objects.Figure
        A Plotly figure with the original data and a dropdown menu.
    """

    # Create a figure with subplots
    fig = make_subplots(
        rows=1, cols=2, subplot_titles=("Original Data", "Sampled Data")
    )

    first_key = next(iter(sample_dict))

    # Add original data histogram
    fig.add_trace(
        go.Histogram(
            x=original_data[first_key],
            nbinsx=20,
            name="Original",
            histnorm="probability density",
        ),
        row=1,
        col=1,
    )

    # Add initial sampled data histogram (first key in the dictionary)
    fig.add_trace(
        go.Histogram(
            x=sample_dict[first_key],
            nbinsx=20,
            name="Sampled",
            histnorm="probability density",
        ),
        row=1,
        col=2,
    )

    # Create dropdown buttons
    dropdown_buttons = []
    for key in sample_dict:
        n = len(sample_dict[key])
        button = {
            "method": "update",
            "label": key,
            "args": [
                {"x": [original_data[key], sample_dict[key]]},
                {"title": f"{key}: Original vs Sampled Data ({key}; n = {n})"},
            ],
        }
        dropdown_buttons.append(button)

    # Update layout with dropdown menu
    fig.update_layout(
        updatemenus=[
            {
                "active": 0,
                "buttons": dropdown_buttons,
                "x": 1.0,  # Center horizontally
                "xanchor": "right",  # Anchor point is the center
                "y": 1.1,  # Position vertically (adjust as needed)
                "yanchor": "top",
            }
        ],  # Anchor point is the top
        title_text=f"{first_key}: Original vs Sampled Data ({first_key})",
        xaxis_title_text="Length of Stay (Days)",
        yaxis_title_text="Relative Freq",
        height=500,
        width=900,
    )

    return fig
```

```{python}
# Generate samples from each distribution
samples = {}
for hospital in hospital_data.columns:
    samples[hospital] = distributions[hospital].sample(size=1000)

fig = plotly_original_versus_sampled(samples, hospital_data)

fig.show()
```

## 2. Sampling from the Raw data directly

If the raw data is available then an alternative to Law and Kelton's interpolation method israw data directly using `RawDiscreteEmpirical`.

```{python}
distributions = {}
samples = {}
for hospital in hospital_data.columns:
    distributions[hospital] = RawDiscreteEmpirical(hospital_data[hospital])
    samples[hospital] = distributions[hospital].sample(size=1000)

fig = plotly_original_versus_sampled(samples, hospital_data)

fig.show()
```

## 3. Sampling from grouped empirical data

In some circumstances a modeller may need to sample from simpler grouped data.  For example, if the raw data are not available (e.g. if taken from an academic research paper) or if the dataset is large and causes memory issues or a bottleneck in model run time. 

`GroupedContinuousEmpirical` implements the interpolation method described in Law and Kelton's *Simulation Modeling and Analysis*.  

We need to provide the data as k adjacent intervals 

$$[a_0, a1), [a_1, a_2), \dots, [a_{k-1}, a_k)$$

In practice we provide `GroupedContinuousEmpirical` with two arrays: `lower_bounds` and `upper_bounds`.  Note that the intervals do no need to be equally spaced.

A third array `freq` contains the number of observations in each interval.

```{python}
grp_dist = GroupedContinuousEmpirical(
    lower_bounds=[0, 5, 10, 15, 30, 45, 60, 120, 180, 240, 480],
    upper_bounds=[5, 10, 15, 30, 45, 60, 120, 180, 240, 480, 2880],
    freq=[34, 4, 8, 13, 15, 13, 19, 13, 9, 12, 73],
    random_seed=42,
)

# sample will interpolate across and within each group.
grp_dist.sample()
```


### Time dependent arrivals via thinning

This notebook provides an overview of how to use the `time_dependent.NSPPThinning` class.  

Thinning is an acceptance-rejection approach to sampling inter-arrival times (IAT) from a time dependent distribution where each time period follows its own exponential distribution.

There are two random variables employed in sampling: an exponential distribution (used to sample IAT) and a uniform distibution (used to accept/reject samples).

All IATs are sampled from an Exponential distribution with the highest arrival rate (most frequent). These arrivals are then rejected (thinned) proportional to the ratio of the current arrival rate to the maximum arrival rate.  The algorithm executes until a sample is accepted.  The IAT returned is the sum of all the IATs that were sampled.

```{note}
As of v0.9.0 you only need to provide an arrival data set containing two columns

* `t`: the time period e.g. 0, 60, 120 minutes or 0, 1, 2 hours
* `mean_iat`: the **mean interval arrival time (IAT)** for the time period.
```

## The thinning algorithm

A NSPP has arrival rate $\lambda(t)$ where $0 \leq t \leq T$

Here $i$ is the arrival number and $\mathcal{T_i}$ is its arrival time.

1. Let $\lambda^* = \max_{0 \leq t \leq T}\lambda(t)$ be the maximum of the arrival rate function and set $t = 0$ and $i=1$

2. Generate $e$ from the exponential distribution with rate $\lambda^*$ and let $t = t + e$ (this is the time of the next entity will arrive)

3. Generate $u$ from the $U(0,1)$ distribution.  If $u \leq \dfrac{\lambda(t)}{\lambda^*}$ then $\mathcal{T_i} =t$ and $i = i + 1$

4. Go to Step 2.


## `sim-tools` imports

```{python}
# pylint: disable=missing-module-docstring,invalid-name
import numpy as np
import matplotlib.pyplot as plt

from sim_tools.datasets import load_banks_et_al_nspp
from sim_tools.time_dependent import NSPPThinning, nspp_plot, nspp_simulation
```

```{python}
# ggplot style for plots
plt.style.use("ggplot")
```

## Example from Banks et al.

We will illustrate the use of `NSPPThinning` using an example from Banks et al. 

The table below breaks an arrival process down into 60 minutes intervals.


| t(min) | Mean time between arrivals (min) | Arrival Rate $\lambda(t)$ (arrivals/min) |
|:------:|:--------------------------------:|:--------------------------------------:|
|    0   |                15                |                  1/15                  |
|   60   |                12                |                  1/12                  |
|   120  |                 7                |                   1/7                  |
|   180  |                 5                |                   1/5                  |
|   240  |                 8                |                   1/8                  |
|   300  |                10                |                  1/10                  |
|   360  |                15                |                  1/15                  |
|   420  |                20                |                  1/20                  |
|   480  |                20                |                  1/20                  |


> **Interpretation**: In the table above the fastest arrival rate is 1/5 customers per minute or 5 minutes between customer arrivals.

```{python}
data = load_banks_et_al_nspp()
print(data)
```

The `NSPPThinning` class does not require the `arrival_rate` column. Internally it assume that intervals are equal and uses the 1st and 2nd rows of the `t` column to calculate width (override with the `interval_width` parameter).  As the IAT is the inverse of the rate it thins the arrivals using 

$$ u \leq \dfrac{IAT_{min}}{IAT(t)}$$

```{python}
# create arrivals and set random number seeds
seeds = np.random.SeedSequence(42).spawn(2)
arrivals = NSPPThinning(data=data, random_seed1=seeds[0], random_seed2=seeds[1])

print(arrivals)
```

```{python}
# number of arrivals to simulate
n_arrivals = 15

# run simulation
simulation_time = 0.0
for _ in range(n_arrivals):
    iat = arrivals.sample(simulation_time)
    simulation_time += iat
    print(f"{simulation_time:.2f} Patient arrival (IAT={iat:.2f})")
```

Alternatively you could set the interval width manually

```{python}
# create arrivals and set random number seeds
seeds = np.random.SeedSequence(42).spawn(2)
arrivals = NSPPThinning(
    data=data, interval_width=60, random_seed1=seeds[0], random_seed2=seeds[1]
)
print(arrivals)
```

## Check NSPP is working as expected

As part of model testing you can check that the NSPP for arrivals has been setup correctly by using `nspp_plot`. This runs multiple replications of the NSPP and returns a plot of the number of arrivals per interval.  The profile can be used as a sense check.  If you prefer to have the data in tabular form you can use `nspp_simulation` that returns a `pd.DataFrame` containing multiple replications of the number of arrivals per interval.

### NSPP plot

```{python}
fig, ax = nspp_plot(data, run_length=540)
```

### Tabular NSPP data

The function `nspp_simulation` returns a `pd.DataFrame`. By default 1000 replications of the NSPP are run.  The run length can be specified, but if omitted it defaults to the last interval + interval width in the arrival profile.

In the Banks et al. example we are given the rate per minute of arrivals. We therefore multiply this by 60 get the expected number of arrivals in an hour.  We compare this to the actual simulated.

```{python}
nspp_replications = nspp_simulation(data, n_reps=1000)

# peak at data
nspp_replications.head()
```

```{python}
# re-label columns
nspp_replications.columns = data["t"].values


## banks data converted from arrivasl per min to arrivals per hour
data["expected_mean_arrivals_hr"] = data["arrival_rate"] * 60.0
data["simulated_mean_arrivals_hr"] = nspp_replications.mean(axis=0).values
data.round(1)
```


### A simple trace

For debugging code based models such as those using `simpy` it is sometimes useful to do informal debugging by printing events as the simulation executes.  Python's built in `print` function can be used. Alternatively `sim-tools.trace` can be used.

For small models the `trace` module offers the function `trace()`. It provides easy to use functionality with defaults.

This notebooks illustrates simple tracing of simulation using some small models.

## 1. Imports


```{python}
# pylint: disable=missing-module-docstring,invalid-name

import itertools
from typing import Optional

import simpy
from sim_tools.trace import trace
from sim_tools.distributions import Exponential
```

## 2. A trace using default colouring

In this first example we simulate patient arrivals to a treatment facility.  Arrivals are at random with a mean inter-arrival time of 5 minutes.
Rather than using python's print command we instead make use of `trace`. We pass the following arguments:

1. `time`: the current simulation time
2. `debug`: if we toggle to `False` this hides the trace or `True` to show (default = False)
3. `msg`: the string message to display. This can include emoji 
4. `process_id`: an optional string to identify the process. Ideally this should be unique to aid debugging.

```{python}
def patient_generator(
    env: simpy.Environment, dist: Exponential, debug: Optional[bool] = False
):
    """Generate patient arrivals to the treatment clinic"""
    for patient_count in itertools.count(1):
        # sample inter-arrival time
        iat = dist.sample()
        yield env.timeout(iat)
        trace(time=env.now, debug=debug, msg="new arrival 🤒", identifier=patient_count)
```

```{python}
# script to run model
DEBUG = True
SEED = 42
arrival_dist = Exponential(5.0, random_seed=SEED)
env = simpy.Environment()
env.process(patient_generator(env, arrival_dist, DEBUG))
env.run(50.0)
```

## 3. Configure colouring of output


In `trace` the `config` parameter is a user settable dictionary object. It can be used to change the colour of text in the trace.  We will first show a simple demonstration of setting options. and then use to illustrate how it is useful in practice.

The default`config` is

```{.python}
config = {
    "class":None, 
    "class_colour":"bold blue", 
    "time_colour":'bold blue', 
    "time_dp":2,
    "message_colour":'black',
    "tracked":None
}
```

* `class`: a string representing the class or type of trace event occuring.  This could be a process type for example, "patient", "stroke patient" or "arrival" or "treatment".
* `class_colour`: choose a colour to display the class name e.g. "green" or "bold green"
* `time_colour`: choose a colour to display the time
* `time_dp`: choose the number of decimal places for time (default=2)
* `message_color`: colour of the message text
* `tracked`: a list containing identifiers (e.g. `[1, 2, 25]`) that limits what is tracked.  Works with `identiifier` parameters of `trace`

> Note: you do not need to set all of the parameters.  Just set what you need and the defaults will be used for other parameters.

```{python}
def get_config():
    """Returns a custom trace configuration"""
    config = {
        "class": "Arrival",
        "class_colour": "green",
        "time_colour": "bold black",
        "message_colour": "red",
    }
    return config


# pylint: disable=function-redefined
def patient_generator(
    env: simpy.Environment, dist: Exponential, debug: Optional[bool] = False
):
    """Generate patient arrivals to the treatment clinic"""
    for patient_count in itertools.count(1):
        # sample inter-arrival time
        iat = dist.sample()
        yield env.timeout(iat)
        trace(
            time=env.now,
            debug=debug,
            msg="new patient 🤒",
            identifier=patient_count,
            config=get_config(),
        )
```

```{python}
# script to run model
DEBUG = True
SEED = 42
arrival_dist = Exponential(5.0, random_seed=SEED)
env = simpy.Environment()
env.process(patient_generator(env, arrival_dist, DEBUG))
env.run(50.0)
```

## 4. Configuring multiple processes/ activities

We modify the `simpy` example to include a `treatment` process.  We wish for treatment events to be easily seen in the trace output.  We therefore modify `get_config()` as follows:

```{python}
def get_config():
    """Returns a custom trace configuration for arrivals"""
    config = {
        "arrivals": {
            "class": "Patient",
            "class_colour": "green",
            "time_colour": "bold black",
            "message_colour": "green",
        },
        "treatment": {
            "class": "Patient",
            "class_colour": "magenta",
            "time_colour": "bold black",
            "message_colour": "magenta",
        },
    }

    return config
```

The function now returns a nested dictionary containing trace configurations for both arrivals and treatment processes.

The modified `simpy` model is below.  We have included an `Experiment` class to hold our simulation input parameters. Note this has also been used to store the debug toggle.  This reduces the number of arguments each function requires.

When we run the new model is it simple to distinguish between the trace for arrivals and the treatment.  

> Note: this could be further fine tuned by manipulating which processes have `debug=True`

```{python}
# pylint: disable=too-few-public-methods
class Experiment:
    """Parameter container"""

    def __init__(
        self, mean_iat: float, mean_service: float, debug: Optional[bool] = False
    ):
        self.arrival_dist = Exponential(mean_iat, random_seed=42)
        self.treat_dist = Exponential(mean_service, random_seed=101)
        self.debug = debug


def treatment(env: simpy.Environment, exp: Experiment, patient_id: int):
    """Simulate a uncapacitated treatment process"""
    trace(
        time=env.now,
        debug=exp.debug,
        msg="enter treatment 👩🏻‍⚕️",
        identifier=patient_id,
        config=get_config()["treatment"],
    )

    # delay
    service_time = exp.treat_dist.sample()
    yield env.timeout(service_time)

    trace(
        time=env.now,
        debug=exp.debug,
        msg="treatment complete ⛔",
        identifier=patient_id,
        config=get_config()["treatment"],
    )


# pylint: disable=function-redefined
def patient_generator(env: simpy.Environment, exp: Experiment):
    """Generate patient arrivals to the treatment clinic"""
    for patient_count in itertools.count(1):
        # sample inter-arrival time
        iat = exp.arrival_dist.sample()
        yield env.timeout(iat)
        trace(
            time=env.now,
            debug=exp.debug,
            msg="new arrival 🤒",
            identifier=patient_count,
            config=get_config()["arrivals"],
        )
        env.process(treatment(env, exp, patient_count))
```

```{python}
# script to run model
DEBUG = True
experiment = Experiment(5.0, 10.0, DEBUG)
env = simpy.Environment()
env.process(patient_generator(env, experiment))
env.run(40.0)
```

## Tracking

Assume we wish to track patients 4 and 5.  The `config['tracking']` option accepts a list of identifiers and will limit trace output. For simple models it is best to add this as a parameter in the Experiment class.

The script below declares the variable `TRACKED=[4, 5]`

> Note that when there are a lot of processes to track it may be become simplier to use the object orientated interface to trace functionality.

```{python}
# pylint: disable=too-few-public-methods
class Experiment:
    """Parameter container"""

    def __init__(
        self,
        mean_iat: float,
        mean_service: float,
        debug: Optional[bool] = False,
        tracked: Optional[list] = None,
    ):
        self.arrival_dist = Exponential(mean_iat, random_seed=42)
        self.treat_dist = Exponential(mean_service, random_seed=101)
        self.debug = debug
        self.tracked = tracked


def get_config(exp: Experiment):
    """Returns a custom trace configuration for arrivals
    and treatment

    Modified to add tracking information to each config"""
    config = {
        "arrivals": {
            "class": "Patient",
            "class_colour": "green",
            "time_colour": "bold black",
            "message_colour": "green",
            "tracked": exp.tracked,
        },
        "treatment": {
            "class": "Patient",
            "class_colour": "magenta",
            "time_colour": "bold black",
            "message_colour": "magenta",
            "tracked": exp.tracked,
        },
    }

    return config


def treatment(env: simpy.Environment, exp: Experiment, patient_id: int):
    """Simulate a uncapacitated treatment process"""
    trace(
        time=env.now,
        debug=exp.debug,
        msg="enter treatment 👩🏻‍⚕️",
        identifier=patient_id,
        config=get_config(exp)["treatment"],
    )

    # delay
    service_time = exp.treat_dist.sample()
    yield env.timeout(service_time)

    trace(
        time=env.now,
        debug=exp.debug,
        msg="treatment complete ⛔",
        identifier=patient_id,
        config=get_config(exp)["treatment"],
    )


def patient_generator(env: simpy.Environment, exp: Experiment):
    """Generate patient arrivals to the treatment clinic"""
    for patient_count in itertools.count(1):
        # sample inter-arrival time
        iat = exp.arrival_dist.sample()
        yield env.timeout(iat)
        trace(
            time=env.now,
            debug=exp.debug,
            msg="new arrival 🤒",
            identifier=patient_count,
            config=get_config(exp)["arrivals"],
        )
        env.process(treatment(env, exp, patient_count))
```

```{python}
# script to run model
DEBUG = True
TRACKED = [4, 5]
experiment = Experiment(5.0, 10.0, DEBUG, TRACKED)
env = simpy.Environment()
env.process(patient_generator(env, experiment))
env.run(50.0)
```


## 1. Imports

The model used to demonstrate trace functionality is implemented in `simpy`

```{python}
# pylint: disable=missing-module-docstring,invalid-name,too-many-lines

import itertools

import numpy as np
import pandas as pd
import simpy

import sim_tools
from sim_tools.distributions import Exponential, Normal, Lognormal, Bernoulli
from sim_tools.time_dependent import NSPPThinning
from sim_tools.trace import Traceable
```

```{python}
print(f"simpy version: {simpy.__version__}")
print(f"sim-tools version: {sim_tools.__version__}")
```

## 2. Constants and defaults for modelling **as-is**

### 2.1 Distribution parameters

```{python}
# sign-in/triage parameters
DEFAULT_TRIAGE_MEAN = 3.0

# registration parameters
DEFAULT_REG_MEAN = 5.0
DEFAULT_REG_VAR = 2.0

# examination parameters
DEFAULT_EXAM_MEAN = 16.0
DEFAULT_EXAM_VAR = 3.0
DEFAULT_EXAM_MIN = 0.5

# trauma/stabilisation
DEFAULT_TRAUMA_MEAN = 90.0

# Trauma treatment
DEFAULT_TRAUMA_TREAT_MEAN = 30.0
DEFAULT_TRAUMA_TREAT_VAR = 4.0

# Non trauma treatment
DEFAULT_NON_TRAUMA_TREAT_MEAN = 13.3
DEFAULT_NON_TRAUMA_TREAT_VAR = 2.0

# prob patient requires treatment given trauma
DEFAULT_NON_TRAUMA_TREAT_P = 0.60

# proportion of patients triaged as trauma
DEFAULT_PROB_TRAUMA = 0.12
```

### 2.2 Time dependent arrival rates data

The data for arrival rates varies between clinic opening at 6am and closure at 12am.

```{python}
# data are held in the Github repo and loaded from there.
NSPP_PATH = (
    "https://raw.githubusercontent.com/TomMonks/" +
    "open-science-for-sim/main/src/notebooks/01_foss_sim/data/ed_arrivals.csv"
)
```

### 2.3 Resource counts

> Inter count variables representing the number of resources at each activity in the processes.

```{python}
DEFAULT_N_TRIAGE = 1
DEFAULT_N_REG = 1
DEFAULT_N_EXAM = 3
DEFAULT_N_TRAUMA = 2

# Non-trauma cubicles
DEFAULT_N_CUBICLES_1 = 1

# trauma pathway cubicles
DEFAULT_N_CUBICLES_2 = 1
```

### 2.4 Simulation model run settings

```{python}
# default random number SET
DEFAULT_RNG_SET = None
N_STREAMS = 20

# default results collection period
DEFAULT_RESULTS_COLLECTION_PERIOD = 60 * 19

# number of replications.
DEFAULT_N_REPS = 5

# set to True to show a trace of the simulation model (best used in single run
# mode)
DEBUG = False

# provide a list of integers that represent patient IDs.  Only these patients
# will be tracked
TRACKED = None
```

## 3. Model parameterisation

For convienience a container class is used to hold the large number of model parameters.  The `Scenario` class includes defaults  these can easily be changed and at runtime to experiments with different designs.

```{python}
def load_arrival_profile(profile_path):
    """
    Load and transform an arrival profile from a CSV file.

    Parameters:
    -----------
    profile_path : str
        Path to CSV file containing the arrival rate profile. It is expected
        to have a column called 'arrival_rate'.

    Returns:
    --------
    pd.DataFrame
        A DataFrame with the following columns:
        - 't': Time.
        - 'arrival_rate': Original arrival rate from CSV.
        - 'mean_iat': Mean inter-arrival time.
    """
    df = (
        pd.read_csv(profile_path)
        .reset_index()
        .rename(columns={"index": "t"})
        .assign(mean_iat=lambda x: 60.0 / x.arrival_rate, t=lambda x: x.t * 60.0)
    )
    return df
```

```{python}
load_arrival_profile(NSPP_PATH)
```

```{python}
# pylint: disable=too-many-instance-attributes,too-many-arguments
# pylint: disable=too-many-positional-arguments,too-many-locals
class Scenario:
    """
    Container class for scenario parameters/arguments

    Passed to a model and its process classes
    """

    def __init__(
        self,
        random_number_set=DEFAULT_RNG_SET,
        n_triage=DEFAULT_N_TRIAGE,
        n_reg=DEFAULT_N_REG,
        n_exam=DEFAULT_N_EXAM,
        n_trauma=DEFAULT_N_TRAUMA,
        n_cubicles_1=DEFAULT_N_CUBICLES_1,
        n_cubicles_2=DEFAULT_N_CUBICLES_2,
        triage_mean=DEFAULT_TRIAGE_MEAN,
        reg_mean=DEFAULT_REG_MEAN,
        reg_var=DEFAULT_REG_VAR,
        exam_mean=DEFAULT_EXAM_MEAN,
        exam_var=DEFAULT_EXAM_VAR,
        exam_min=DEFAULT_EXAM_MIN,
        trauma_mean=DEFAULT_TRAUMA_MEAN,
        trauma_treat_mean=DEFAULT_TRAUMA_TREAT_MEAN,
        trauma_treat_var=DEFAULT_TRAUMA_TREAT_VAR,
        non_trauma_treat_mean=DEFAULT_NON_TRAUMA_TREAT_MEAN,
        non_trauma_treat_var=DEFAULT_NON_TRAUMA_TREAT_VAR,
        non_trauma_treat_p=DEFAULT_NON_TRAUMA_TREAT_P,
        prob_trauma=DEFAULT_PROB_TRAUMA,
        debug=DEBUG,
        tracked=TRACKED,
    ):
        """
        Create a scenario to parameterise the simulation model

        Parameters:
        -----------
        random_number_set: int, optional (default=DEFAULT_RNG_SET)
            Set to control the initial seeds of each stream of pseudo
            random numbers used in the model.

        n_triage: int
            The number of triage cubicles

        n_reg: int
            The number of registration clerks

        n_exam: int
            The number of examination rooms

        n_trauma: int
            The number of trauma bays for stabilisation

        n_cubicles_1: int
            The number of non-trauma treatment cubicles

        n_cubicles_2: int
            The number of trauma treatment cubicles

        triage_mean: float
            Mean duration of the triage distribution (Exponential)

        reg_mean: float
            Mean duration of the registration distribution (Lognormal)

        reg_var: float
            Variance of the registration distribution (Lognormal)

        exam_mean: float
            Mean of the examination distribution (Normal)

        exam_var: float
            Variance of the examination distribution (Normal)

        exam_min: float
            The minimum value that an examination can take (Truncated Normal)

        trauma_mean: float
            Mean of the trauma stabilisation distribution (Exponential)

        trauma_treat_mean: float
            Mean of the trauma cubicle treatment distribution (Lognormal)

        trauma_treat_var: float
            Variance of the trauma cubicle treatment distribution (Lognormal)

        non_trauma_treat_mean: float
            Mean of the non trauma treatment distribution

        non_trauma_treat_var: float
            Variance of the non trauma treatment distribution

        non_trauma_treat_p: float
            Probability non trauma patient requires treatment

        prob_trauma: float
            probability that a new arrival is a trauma patient.

        debug: int
            Set to true to display a trace of simulated events in the model
            Note this is best used in single_run mode.

        tracked: list
            List of integers that represent patient IDs.
            Used with debug. If debug is True then only display trace for
            specific patients.
        """
        # sampling
        self.random_number_set = random_number_set

        # store parameters for sampling
        self.triage_mean = triage_mean
        self.reg_mean = reg_mean
        self.reg_var = reg_var
        self.exam_mean = exam_mean
        self.exam_var = exam_var
        self.exam_min = exam_min
        self.trauma_mean = trauma_mean
        self.trauma_treat_mean = trauma_treat_mean
        self.trauma_treat_var = trauma_treat_var
        self.non_trauma_treat_mean = non_trauma_treat_mean
        self.non_trauma_treat_var = non_trauma_treat_var
        self.non_trauma_treat_p = non_trauma_treat_p
        self.prob_trauma = prob_trauma

        self.init_sampling()

        # count of each type of resource
        self.init_resourse_counts(
            n_triage, n_reg, n_exam, n_trauma, n_cubicles_1, n_cubicles_2
        )

        # debug/trace information
        self.debug = debug
        self.tracked = tracked

    def set_random_no_set(self, random_number_set):
        """
        Controls the random sampling
        Parameters:
        ----------
        random_number_set: int
            Used to control the set of pseudo random numbers
            used by the distributions in the simulation.
        """
        self.random_number_set = random_number_set
        self.init_sampling()

    # pylint: disable=too-many-arguments,too-many-positional-arguments
    def init_resourse_counts(
        self, n_triage, n_reg, n_exam, n_trauma, n_cubicles_1, n_cubicles_2
    ):
        """
        Init the counts of resources to default values...
        """
        self.n_triage = n_triage
        self.n_reg = n_reg
        self.n_exam = n_exam
        self.n_trauma = n_trauma

        # non-trauma (1), trauma (2) treatment cubicles
        self.n_cubicles_1 = n_cubicles_1
        self.n_cubicles_2 = n_cubicles_2

    def init_sampling(self):
        """
        Create the distributions used by the model and initialise
        the random seeds of each.
        """
        # MODIFICATION. Better method for producing n non-overlapping streams
        seed_sequence = np.random.SeedSequence(self.random_number_set)

        # Generate n high quality child seeds
        self.seeds = seed_sequence.spawn(N_STREAMS)

        # create distributions

        # Triage duration
        self.triage_dist = Exponential(self.triage_mean, random_seed=self.seeds[0])

        # Registration duration (non-trauma only)
        self.reg_dist = Lognormal(
            self.reg_mean, np.sqrt(self.reg_var), random_seed=self.seeds[1]
        )

        # Evaluation (non-trauma only)
        self.exam_dist = Normal(
            self.exam_mean,
            np.sqrt(self.exam_var),
            minimum=self.exam_min,
            random_seed=self.seeds[2],
        )

        # Trauma/stablisation duration (trauma only)
        self.trauma_dist = Exponential(self.trauma_mean, random_seed=self.seeds[3])

        # Non-trauma treatment
        self.nt_treat_dist = Lognormal(
            self.non_trauma_treat_mean,
            np.sqrt(self.non_trauma_treat_var),
            random_seed=self.seeds[4],
        )

        # treatment of trauma patients
        self.treat_dist = Lognormal(
            self.trauma_treat_mean,
            np.sqrt(self.trauma_treat_var),
            random_seed=self.seeds[5],
        )

        # probability of non-trauma patient requiring treatment
        self.nt_p_treat_dist = Bernoulli(
            self.non_trauma_treat_p, random_seed=self.seeds[6]
        )

        # probability of non-trauma versus trauma patient
        self.p_trauma_dist = Bernoulli(self.prob_trauma, random_seed=self.seeds[7])

        # init sampling for non-stationary poisson process
        self.init_nspp()

    def init_nspp(self):
        """
        Initialise the Nonhomogeneous Poisson Process (NSPP) for arrivals.
        """

        # Load arrival profile
        self.arrivals = load_arrival_profile(NSPP_PATH)

        # Compuate maximum arrival rate (smallest time between arrivals)
        self.lambda_max = self.arrivals["arrival_rate"].max()

        # NSPP thinning distribution
        self.thinning_dist = NSPPThinning(
            self.arrivals, random_seed1=self.seeds[8], random_seed2=self.seeds[9]
        )
```

## 6. Patient Pathways Process Logic

`simpy` uses a process based worldview.  We can easily create whatever logic - simple or complex  for the model.  Here the process logic for trauma and non-trauma patients is seperated into two classes `TraumaPathway`  and `NonTraumaPathway`. 

```{python}
class TraumaPathway(Traceable):
    """
    Encapsulates the process a patient with severe injuries or illness.

    These patients are signed into the ED and triaged as having severe injuries
    or illness.

    Patients are stabilised in resus (trauma) and then sent to Treatment.
    Following treatment they are discharged.
    """

    def __init__(self, identifier, env, args):
        """
        Constructor method

        Params:
        -----
        identifier: int
            a numeric identifier for the patient.

        env: simpy.Environment
            the simulation environment

        args: Scenario
            Container class for the simulation parameters

        """
        # initialise Trace
        super().__init__(debug=args.debug)

        self.identifier = identifier
        self.env = env
        self.args = args

        # metrics
        self.arrival = -np.inf
        self.wait_triage = -np.inf
        self.wait_trauma = -np.inf
        self.wait_treat = -np.inf
        self.total_time = -np.inf

        self.triage_duration = -np.inf
        self.trauma_duration = -np.inf
        self.treat_duration = -np.inf

    def execute(self):
        """
        simulates the major treatment process for a patient

        1. request and wait for sign-in/triage
        2. trauma
        3. treatment
        """
        # record the time of arrival and entered the triage queue
        self.arrival = self.env.now

        self.trauma_arrival()

        # request sign-in/triage
        with self.args.triage.request() as req:
            yield req
            # record the waiting time for triage
            self.wait_triage = self.env.now - self.arrival

            # triage begin event
            self.triage_begin()

            # sample triage duration.
            self.triage_duration = self.args.triage_dist.sample()
            yield self.env.timeout(self.triage_duration)

            # triage complete event
            self.triage_complete()

        # record the time that entered the trauma queue
        start_wait = self.env.now

        # request trauma room
        with self.args.trauma.request() as req:
            yield req

            # record the waiting time for trauma
            self.wait_trauma = self.env.now - start_wait

            # trauma begin event
            self.trauma_begin()

            # sample stablisation duration.
            self.trauma_duration = self.args.trauma_dist.sample()
            yield self.env.timeout(self.trauma_duration)

            # trauma complete event
            self.trauma_complete()

        # record the time that entered the treatment queue
        start_wait = self.env.now

        # request treatment cubicle
        with self.args.cubicle_2.request() as req:
            yield req

            # record the waiting time for trauma
            self.wait_treat = self.env.now - start_wait
            self.treatment_begin()

            # sample treatment duration.
            self.treat_duration = self.args.treat_dist.sample()
            yield self.env.timeout(self.treat_duration)

            self.treatment_complete()

        # total time in system
        self.total_time = self.env.now - self.arrival

    def trauma_arrival(self):
        """Trauma arrival event"""
        self.trace(self.env.now, "arrival at centre 🚑", self.identifier)

    def triage_begin(self):
        """Enter triage event"""
        self.trace(
            self.env.now,
            f"enter triage. Waiting time: {self.wait_triage:.3f}",
            self.identifier,
        )

    def trauma_begin(self):
        """Enter trauma stabilisation"""
        self.trace(
            self.env.now,
            f"enter stabilisation. Waiting time: {self.wait_trauma:.3f}",
            self.identifier,
        )

    def treatment_begin(self):
        """Enter trauma treatment post stabilisation"""
        self.trace(
            self.env.now,
            f"enter treatment. Waiting time: {self.wait_treat:.3f}",
            self.identifier,
        )

    def triage_complete(self):
        """Triage complete event"""
        self.trace(self.env.now, "triage complete", self.identifier)

    def trauma_complete(self):
        """Patient stay in trauma is complete."""
        self.trace(self.env.now, "stabilisation complete.", self.identifier)

    def treatment_complete(self):
        """Treatment complete event"""
        self.trace(
            self.env.now,
            f"patient {self.identifier} treatment complete; "
            f"waiting time was {self.wait_treat:.3f}",
            self.identifier,
        )

    def _trace_config(self):
        """Override trace config"""
        config = {
            "name": "Trauma",
            "name_colour": "bold magenta",
            "time_colour": "bold magenta",
            "tracked": self.args.tracked,
        }
        return config
```

```{python}
class NonTraumaPathway(Traceable):
    """
    Encapsulates the process a patient with minor injuries and illness.

    These patients are signed into the ED and triaged as having minor
    complaints and streamed to registration and then examination.

    Post examination 40% are discharged while 60% proceed to treatment.
    Following treatment they are discharged.
    """

    def __init__(self, identifier, env, args):
        """
        Constructor method

        Params:
        -----
        identifier: int
            a numeric identifier for the patient.

        env: simpy.Environment
            the simulation environment

        args: Scenario
            Container class for the simulation parameters

        """
        super().__init__(debug=args.debug)
        self.identifier = identifier
        self.env = env
        self.args = args

        # triage resource
        self.triage = args.triage

        # metrics
        self.arrival = -np.inf
        self.wait_triage = -np.inf
        self.wait_reg = -np.inf
        self.wait_exam = -np.inf
        self.wait_treat = -np.inf
        self.total_time = -np.inf

        self.triage_duration = -np.inf
        self.reg_duration = -np.inf
        self.exam_duration = -np.inf
        self.treat_duration = -np.inf

    def execute(self):
        """
        simulates the non-trauma/minor treatment process for a patient

        1. request and wait for sign-in/triage
        2. patient registration
        3. examination
        4. split patients
        4.1 % discharged
        4.2 % treatment then discharge
        """
        # record the time of arrival and entered the triage queue
        self.arrival = self.env.now

        # trace arrival
        self.trace(self.env.now, "arrival to centre.", self.identifier)

        # request sign-in/triage
        with self.triage.request() as req:
            yield req

            # record the waiting time for triage
            self.wait_triage = self.env.now - self.arrival

            # trace enter triage
            self.trace(
                self.env.now,
                f"entering triage. Waiting time: {self.wait_triage:.3f} mins",
                self.identifier,
            )

            # sample triage duration.
            self.triage_duration = self.args.triage_dist.sample()
            yield self.env.timeout(self.triage_duration)

            # trace exit triage
            self.trace(self.env.now, "triage complete", self.identifier)

        # record the time that entered the registration queue
        start_wait = self.env.now

        # request registration clert
        with self.args.registration.request() as req:
            yield req

            # record the waiting time for registration
            self.wait_reg = self.env.now - start_wait

            # trace begin registration
            self.trace(self.env.now, "starting patient registration.", self.identifier)

            # sample registration duration.
            self.reg_duration = self.args.reg_dist.sample()
            yield self.env.timeout(self.reg_duration)

            # registration complete...
            self.trace(
                self.env.now,
                f"patient registered;waiting time was {self.wait_triage:.3f}",
                self.identifier,
            )

        # record the time that entered the evaluation queue
        start_wait = self.env.now

        # request examination resource
        with self.args.exam.request() as req:
            yield req

            # record the waiting time for registration
            self.wait_exam = self.env.now - start_wait

            # trace begin examination
            self.trace(
                self.env.now,
                f"enter examination. Waiting time: {self.wait_exam:.3f}",
                self.identifier,
            )

            # sample examination duration.
            self.exam_duration = self.args.exam_dist.sample()
            yield self.env.timeout(self.exam_duration)

            # trace exit examination
            self.trace(self.env.now, "examination complete.", self.identifier)

        # sample if patient requires treatment?
        self.require_treat = self.args.nt_p_treat_dist.sample()

        if self.require_treat:
            # record the time that entered the treatment queue
            start_wait = self.env.now

            # request treatment cubicle
            with self.args.cubicle_1.request() as req:
                yield req

                # record the waiting time for treatment
                self.wait_treat = self.env.now - start_wait

                # trace enter treatment
                self.trace(
                    self.env.now,
                    f"enter treatment. Waiting time:{self.wait_treat:.3f}",
                    self.identifier,
                )

                # sample treatment duration.
                self.treat_duration = self.args.nt_treat_dist.sample()
                yield self.env.timeout(self.treat_duration)

                # trace exit treatment
                self.trace(self.env.now, "treatment complete ⛔", self.identifier)

        # total time in system
        self.total_time = self.env.now - self.arrival

    def _trace_config(self):
        """Override trace config"""
        config = {
            "name": "Non-trauma",
            "name_colour": "bold green",
            "time_colour": "bold green",
            "message_colour": "black",
            "tracked": self.args.tracked,
        }
        return config
```

## 7. Main model class

The main class that a user interacts with to run the model is `TreatmentCentreModel`.  This implements a `.run()` method, contains a simple algorithm for the non-stationary poission process for patients arrivals and inits instances of `TraumaPathway` or `NonTraumaPathway` depending on the arrival type.

```{python}
class TreatmentCentreModel:
    """
    The treatment centre model

    Patients arrive at random to a treatment centre, are triaged
    and then processed in either a trauma or non-trauma pathway.
    """

    def __init__(self, args):
        self.env = simpy.Environment()
        self.args = args
        self.init_resources()

        self.patients = []
        self.trauma_patients = []
        self.non_trauma_patients = []

        self.rc_period = None
        self.results = None

    def init_resources(self):
        """
        Init the number of resources
        and store in the arguments container object

        Resource list:
        1. Sign-in/triage bays
        2. registration clerks
        3. examination bays
        4. trauma bays
        5. non-trauma cubicles (1)
        6. trauma cubicles (2)

        """
        # sign/in triage
        self.args.triage = simpy.Resource(self.env, capacity=self.args.n_triage)

        # registration
        self.args.registration = simpy.Resource(self.env, capacity=self.args.n_reg)

        # examination
        self.args.exam = simpy.Resource(self.env, capacity=self.args.n_exam)

        # trauma
        self.args.trauma = simpy.Resource(self.env, capacity=self.args.n_trauma)

        # non-trauma treatment
        self.args.cubicle_1 = simpy.Resource(self.env, capacity=self.args.n_cubicles_1)

        # trauma treatment
        self.args.cubicle_2 = simpy.Resource(self.env, capacity=self.args.n_cubicles_2)

    def run(self, results_collection_period=DEFAULT_RESULTS_COLLECTION_PERIOD):
        """
        Conduct a single run of the model in its current
        configuration.

        Parameters:
        ----------
        results_collection_period, float, optional
            default = DEFAULT_RESULTS_COLLECTION_PERIOD

        warm_up, float, optional (default=0)
            length of initial transient period to truncate
            from results.

        Returns:
        --------
            None
        """
        # setup the arrival generator process
        self.env.process(self.arrivals_generator())

        # store rc period
        self.rc_period = results_collection_period

        # run
        self.env.run(until=results_collection_period)

    def arrivals_generator(self):
        """
        Simulate the arrival of patients to the model

        Patients either follow a TraumaPathway or
        NonTraumaPathway simpy process.

        Non stationary arrivals implemented via Thinning acceptance-rejection
        algorithm.
        """
        for patient_count in itertools.count():
            # sample iat via thinning
            interarrival_time = self.args.thinning_dist.sample(self.env.now)

            # delay until next arrival
            yield self.env.timeout(interarrival_time)

            # sample if the patient is trauma or non-trauma
            trauma = self.args.p_trauma_dist.sample()
            if trauma:
                # create and store a trauma patient to update KPIs.
                new_patient = TraumaPathway(patient_count, self.env, self.args)
                self.trauma_patients.append(new_patient)
            else:
                # create and store a non-trauma patient to update KPIs.
                new_patient = NonTraumaPathway(patient_count, self.env, self.args)
                self.non_trauma_patients.append(new_patient)

            # start the pathway process for the patient
            self.env.process(new_patient.execute())
```

### 8. Logic to process end of run results.

the class `SimulationSummary` accepts a `TraumaCentreModel`.  At the end of a run it can be used calculate mean queuing times and the percentage of the total run that a resource was in use.

```{python}
class SimulationSummary:
    """
    End of run result processing logic of the simulation model
    """

    def __init__(self, model):
        """
        Constructor

        Params:
        ------
        model: TraumaCentreModel
            The model.
        """
        self.model = model
        self.args = model.args
        self.results = None

    def process_run_results(self):
        """
        Calculates statistics at end of run.
        """
        self.results = {}
        # list of all patients
        patients = self.model.non_trauma_patients + self.model.trauma_patients

        # mean triage times (both types of patient)
        mean_triage_wait = self.get_mean_metric("wait_triage", patients)

        # triage utilisation (both types of patient)
        triage_util = self.get_resource_util(
            "triage_duration", self.args.n_triage, patients
        )

        # mean waiting time for registration (non_trauma)
        mean_reg_wait = self.get_mean_metric("wait_reg", self.model.non_trauma_patients)

        # registration utilisation (trauma)
        reg_util = self.get_resource_util(
            "reg_duration", self.args.n_reg, self.model.non_trauma_patients
        )

        # mean waiting time for examination (non_trauma)
        mean_wait_exam = self.get_mean_metric(
            "wait_exam", self.model.non_trauma_patients
        )

        # examination utilisation (non-trauma)
        exam_util = self.get_resource_util(
            "exam_duration", self.args.n_exam, self.model.non_trauma_patients
        )

        # mean waiting time for treatment (non-trauma)
        mean_treat_wait = self.get_mean_metric(
            "wait_treat", self.model.non_trauma_patients
        )

        # treatment utilisation (non_trauma)
        treat_util1 = self.get_resource_util(
            "treat_duration", self.args.n_cubicles_1, self.model.non_trauma_patients
        )

        # mean total time (non_trauma)
        mean_total = self.get_mean_metric("total_time", self.model.non_trauma_patients)

        # mean waiting time for trauma
        mean_trauma_wait = self.get_mean_metric(
            "wait_trauma", self.model.trauma_patients
        )

        # trauma utilisation (trauma)
        trauma_util = self.get_resource_util(
            "trauma_duration", self.args.n_trauma, self.model.trauma_patients
        )

        # mean waiting time for treatment (trauma)
        mean_treat_wait2 = self.get_mean_metric(
            "wait_treat", self.model.trauma_patients
        )

        # treatment utilisation (trauma)
        treat_util2 = self.get_resource_util(
            "treat_duration", self.args.n_cubicles_2, self.model.trauma_patients
        )

        # mean total time (trauma)
        mean_total2 = self.get_mean_metric("total_time", self.model.trauma_patients)

        self.results = {
            "00_arrivals": len(patients),
            "01a_triage_wait": mean_triage_wait,
            "01b_triage_util": triage_util,
            "02a_registration_wait": mean_reg_wait,
            "02b_registration_util": reg_util,
            "03a_examination_wait": mean_wait_exam,
            "03b_examination_util": exam_util,
            "04a_treatment_wait(non_trauma)": mean_treat_wait,
            "04b_treatment_util(non_trauma)": treat_util1,
            "05_total_time(non-trauma)": mean_total,
            "06a_trauma_wait": mean_trauma_wait,
            "06b_trauma_util": trauma_util,
            "07a_treatment_wait(trauma)": mean_treat_wait2,
            "07b_treatment_util(trauma)": treat_util2,
            "08_total_time(trauma)": mean_total2,
            "09_throughput": self.get_throughput(patients),
        }

    def get_mean_metric(self, metric, patients):
        """
        Calculate mean of the performance measure for the
        select cohort of patients,

        Only calculates metrics for patients where it has been
        measured.

        Params:
        -------
        metric: str
            The name of the metric e.g. 'wait_treat'

        patients: list
            A list of patients
        """
        mean = np.array(
            [getattr(p, metric) for p in patients if getattr(p, metric) > -np.inf]
        ).mean()
        return mean

    def get_resource_util(self, metric, n_resources, patients):
        """
        Calculate proportion of the results collection period
        where a resource was in use.

        Done by tracking the duration by patient.

        Only calculates metrics for patients where it has been
        measured.

        Params:
        -------
        metric: str
            The name of the metric e.g. 'treatment_duration'

        patients: list
            A list of patients
        """
        total = np.array(
            [getattr(p, metric) for p in patients if getattr(p, metric) > -np.inf]
        ).sum()

        return total / (self.model.rc_period * n_resources)

    def get_throughput(self, patients):
        """
        Returns the total number of patients that have successfully
        been processed and discharged in the treatment centre
        (they have a total time record)

        Params:
        -------
        patients: list
            list of all patient objects simulated.

        Returns:
        ------
        float
        """
        return len([p for p in patients if p.total_time > -np.inf])

    def summary_frame(self):
        """
        Returns run results as a pandas.DataFrame

        Returns:
        -------
        pd.DataFrame
        """
        # append to results df
        if self.results is None:
            self.process_run_results()

        df = pd.DataFrame({"1": self.results})
        df = df.T
        df.index.name = "rep"
        return df
```

## 9. Model execution

We note that there are **many ways** to setup a `simpy` model and execute it (that is part of its fantastic flexibility).  The organisation of code we show below is based on our experience of using the package in practice. The approach also allows for easy parallisation over multiple CPU cores using `joblib`.

We include two functions. `single_run()` and `multiple_replications`.  The latter is used to repeatedly call and process the results from `single_run`.

```{python}
def single_run(
    scenario, rc_period=DEFAULT_RESULTS_COLLECTION_PERIOD, random_no_set=DEFAULT_RNG_SET
):
    """
    Perform a single run of the model and return the results

    Parameters:
    -----------

    scenario: Scenario object
        The scenario/paramaters to run

    rc_period: int
        The length of the simulation run that collects results

    random_no_set: int or None, optional (default=DEFAULT_RNG_SET)
        Controls the set of random seeds used by the stochastic parts of the
        model.  Set to different ints to get different results.  Set to None
        for a random set of seeds.

    Returns:
    --------
        pandas.DataFrame:
        results from single run.
    """

    # set random number set - this controls sampling for the run.
    scenario.set_random_no_set(random_no_set)

    # create an instance of the model
    model = TreatmentCentreModel(scenario)

    # run the model
    model.run(results_collection_period=rc_period)

    # run results
    summary = SimulationSummary(model)
    summary_df = summary.summary_frame()

    return summary_df
```

```{python}
def multiple_replications(
    scenario, rc_period=DEFAULT_RESULTS_COLLECTION_PERIOD, n_reps=5
):
    """
    Perform multiple replications of the model.

    Params:
    ------
    scenario: Scenario
        Parameters/arguments to configurethe model

    rc_period: float, optional (default=DEFAULT_RESULTS_COLLECTION_PERIOD)
        results collection period.
        the number of minutes to run the model to collect results

    n_reps: int, optional (default=DEFAULT_N_REPS)
        Number of independent replications to run.

    Returns:
    --------
    pandas.DataFrame
    """

    results = [
        single_run(scenario, rc_period, random_no_set=rep) for rep in range(n_reps)
    ]

    # format and return results in a dataframe
    df_results = pd.concat(results)
    df_results.index = np.arange(1, len(df_results) + 1)
    df_results.index.name = "rep"
    return df_results
```

### 9.1 Single run of the model

The script below performs a single replication of the simulation model.  

**Try:**

* Changing the `random_no_set` of the `single_run` call.
* Assigning the value `True` to `debug`
* Tracking patient arrivals 10 and 19 through the model by setting `tracked=[10, 19]`

```{python}
# create the default scenario.
args = Scenario(debug=True, tracked=[10, 19])

# use the single_run() func
# try changing `random_no_set` to see different run results
print("Running simulation ...", end=" => ")
results = single_run(args, random_no_set=42)
print("simulation complete.")

# show results (transpose replication for easier view)
print(results.T)
```

### 9.2 Disabling trace for multiple independent replications

**Important**: set `debug=False` in `Scenario` for multiple replications. This will ensure that each replication is not traced.

```{python}
args = Scenario(debug=False)

# run multiple replications.
results = multiple_replications(args, n_reps=50)

print("All replications complete.")

# summarise the results (2.dp)
results.mean().round(2)
```


### Using the example `treat-sim` model

For examples in this section, we'll use the `treat-sim` simulation model. This is an example model of a terminating system based on exercise 13 from page 170 of:

> Nelson. B.L. (2013). Foundations and methods of stochastic simulation. Springer.

## Installing the model

The model is available on [GitHub](https://github.com/pythonhealthdatascience/stars-treat-sim), and can be installed from [PyPI](https://pypi.org/project/treat-sim/):

```bash
pip install treat-sim
```


```{python}
# pylint: disable=missing-module-docstring
from treat_sim.model import Scenario, single_run, multiple_replications
```

## Running a single replication

1. **Create an instance of `Scenario`**. This contains all the parameters used in the model (e.g. resources and statistical distributions for arrivals and activities).
2. **Run the model once with `single_run()`**. This function is passed an instance of `Scenario` and an integer to set the random numbers used.

This returns a `pandas.DataFrame` with all performance measures for that replication, e.g. waiting times, utilisation, throughput.

```{python}
scenario = Scenario()

# Run a single replication
results = single_run(scenario, random_no_set=1)
print(results.T)
```

You can extract a single metric like this:

```{python}
print(results["01a_triage_wait"].iloc[0])
```

## Running multiple replications

Running several replications at once is very similar to running a single replication — you just use the `multiple_replications()` function instead. You'll need to provide an instance of `Scenario`, along with the number of replications you want to run.

```{python}
scenario = Scenario()

# Run multiple replications
results = multiple_replications(scenario, n_reps=20)
results.head()
```


### Choosing the number of replications

When running a simulation, you need to decide how many replications (runs) are enough. The **confidence interval method** can be used to help guide this choice.

**Note:** The examples below use the `treat-sim` model. If you haven't run it before, see [Using the example `treat-sim` model](01-treatsim.qmd) for set-up and basic usage.


## Imports

```{python}
# pylint: disable=missing-module-docstring
from treat_sim.model import Scenario, multiple_replications
from sim_tools.output_analysis import (
    confidence_interval_method,
    plotly_confidence_interval_method,
)
```

## Confidence interval method

In this method, you first **run the simulation** for a set number of replications. Due to stochasticity, each will produce slightly different averages for each performance metric.

Once the runs are complete, you go step-by-step (for the first run, then the first two runs, the first three runs, and so on) calculating:

* **Cumulative mean**
* **Confidence interval** around that mean

As the number of replications included increases, you'll typically see the interval narrows. The required number of replications is the point where you feel results are **stable** - i.e. that doing more replications is unlikely to change your conclusions in a meaningful way.

You can decided this by setting a **desired precision**. Precision here means the **percentage deviation** of the confidence interval's half-width from the mean. For example, if the precision is set to `0.1`, it will identify the point where the half-width of the confidence interval is less than or equal to 10% of the mean.

## Example: Single performance metric

The function returns a **tuple** consisting of:

1. The minimum number of replications to achieve the desired precision.
2. A detailed DataFrame of statistics for each stage.

```{python}
scenario = Scenario()
rep_results = multiple_replications(scenario, n_reps=150)

confint_result = confidence_interval_method(
    replications=rep_results["01a_triage_wait"], desired_precision=0.1
)

# View results
print(confint_result[0])
confint_result[1].head()
```

## Visualise results

You can plot how the confidence interval narrows as you add more runs using the `plotly_confidence_interval_method` function.

```{python}
plotly_confidence_interval_method(
    n_reps=confint_result[0], conf_ints=confint_result[1], metric_name="01a_triage_wait"
)
```

## Running on multiple performance metrics

You can check several outcomes at once. Just pass multiple columns to `confidence_interval_method`.

This will return a dictionary, with a tuple for each metric.

```{python}
confint_multiple = confidence_interval_method(
    replications=rep_results[
        ["01a_triage_wait", "01b_triage_util", "02a_registration_wait"]
    ],
    desired_precision=0.1,
)

# View output dictionary keys
print(confint_multiple.keys())
```

```{python}
# View results from one of the metrics
print(confint_multiple["02a_registration_wait"][0])
confint_multiple["02a_registration_wait"][1].head()
```


### Automating the selection of the number of replications

The `sim-tools` package includes an implementation of the "replications algorithm" from [Hoad, Robinson, & Davies (2010)](https://www.jstor.org/stable/40926090), used to automatically determine the number of simulation replications needed to achieve a user-specified confidence interval (CI) precision. It works by combining:

* **The confidence interval method** (described on [the previous page](02-confint_method.qmd)), which estimates the required replications for a target CI half-width.
* **A sequential look-ahead procedure** to check whether the desired CI precision is still met as additional replications are run.

If you make use of this code in your work **please cite the `sim-tools` as well as the original authors article**:

> Hoad, Robinson, & Davies (2010). Automated selection of the number of replications for a discrete-event simulation. *Journal of the Operational Research Society*. [https://www.jstor.org/stable/40926090](https://www.jstor.org/stable/40926090)

**Note:** The examples below use the `treat-sim` model. If you haven't run it before, see [Using the example `treat-sim` model](01-treatsim.qmd) for set-up and basic usage.


## Imports

```{python}
# pylint: disable=missing-module-docstring,invalid-name

from collections.abc import Callable
from typing import Optional

from treat_sim.model import Scenario, single_run
from sim_tools.output_analysis import (
    ReplicationsAlgorithm,
    ReplicationsAlgorithmModelAdapter,
    ReplicationTabulizer,
    plotly_confidence_interval_method,
)
```

##  Replications Algorithm overview

The look ahead component is governed by the formula below. A user specifies a look ahead period called $k_{limit}$. If $n \leq 100$ then a parameter $k_{limit}$ is used.  Hoad et al.(2010) recommend a value of 5 based on extensive empirical work. If $n \geq 100$ then a fraction is returned.  The notation $\lfloor$ and $\rfloor$ mean that the function must return an integer value (as it represents the number of replications to look ahead).   

$$
f(k_{limit}, n) = \left\lfloor \dfrac{k_{limit}}{100}\max(n, 100)\right\rfloor 
$$

In general $f(k_{limit}, n)$ means that as the number of replications increases so does the look ahead period.  For example when you have run 50 replications a $k_{limit} = 5$ means you simulate an extra 5 replications.  When you have run 800 replications, a $k_{limit} = 5$ means the number of extra replications you simulate is:

$$
\left\lfloor \dfrac{5}{100} \times 800\right\rfloor = 40
$$

The paper provides a formal algorithm notation. Here we provide a pictorial representation of the general logic of the replications algorithm.

![](../assets/replications_algorithm.png)

## Observers

The `ReplicationAlgorithm` requires an observer. This is an object whose job it is to watch each replication in your algorithm and record relevant statistics as the simulation runs.

A suitable observer is supplied by default - `ReplicationTabulizer` - but you can also create your own, but it needs to follow this protocol:

```{.python}
@runtime_checkable
class AlgorithmObserver(Protocol):
    """
    Protocol for observer classes used in ReplicationsAlgorithm.

    Classes implementing this protocol should provide a `dev` attribute
    to store observations, a method `update` to add new results, and a
    `summary_table` method to summarize the stored replication statistics.

    Attributes
    ----------
    dev : List[Any]
        Collection of observed replication results.

    Methods
    -------
    update(results) -> None
        Add an observation of a replication.

    summary_table() -> pd.DataFrame
        Create a DataFrame summarising all recorded replication statistics.
    """

    dev: List[Any]

    def update(self, results) -> None:
        ...

    def summary_table(self) -> pd.DataFrame:
        ...
```

<div class="alert alert-block alert-info">

**What does this mean?**

`AlgorithmObserver` essentially just tells us that:

* There should be a `.dev` attribute containing the percentage deviation.
* There should be an `update()` method which updates statistics after each replication.
* There should be a `summary_table()` method which returns a dataframe summarising the replication results.

</div>

## Adapting a simulation model to work with `ReplicationsAlgorithm`

### Why an adapter is needed

Every simulation model has it's own way of:

* Being initialised (parameters, scenario set-up)
* Running a single replication (function/method name, arguments)
* Returning results (scalar, dict, `DataFrame`, etc.)

The `ReplicationsAlgorithm` needs to be able to **call any model in a consistent way**, regardless of those differences. To do that, we need to wrap models in an **Adapter** - a "wrapper" object exposing a **fixed interface** the algorithm can use.

### Required interface

We formalise what the adapter must look like with a **Protocol** called `ReplicationsAlgorithmModelAdapter`.

```{.python}
@runtime_checkable
class ReplicationsAlgorithmModelAdapter(Protocol):
    """
    Adapter pattern for the "Replications Algorithm".

    All models that use ReplicationsAlgorithm must provide a
    single_run(replication_number) interface.
    """

    def single_run(self, replication_number: int) -> dict[str, float]:
        """
        Run a single unique replication of the model and return results.

        Parameters
        ----------
        replication_number : int
            The replication sequence number.
        
        Returns
        -------
        dict[str, float]
            {'metric_name': value, ... } for all metrics being tracked.
        """
        pass
```

<div class="alert alert-block alert-info">

**What does this mean?**

`ReplicationsAlgorithmModelAdapter` essentially just tells us that:

* There should be a `single_run()` method which is passed the `replication_number` to seed it.
* It should return a **dictionary** mapping metric name to float value for *every* requested metric, after just one replication.

</div>

### Adapter for `treat-sim`

Below is an adapter for the `treat-sim` model.

```{python}
# pylint: disable=too-few-public-methods
class TreatSimAdapter:
    """
    treat-sim multi-metric single_run interface adapter for the
    ReplicationsAlgorithm.
    """

    def __init__(
        self,
        scenario: Scenario,
        metrics: list[str],
        single_run_func: Callable,
        run_length: Optional[float] = 1140.0,
    ):
        """
        Parameters:
        ------------
        scenario: Scenario
            The experiment and parameter class for treat-sim

        metrics: list[str]
            The metric(s) to output

        single_run_func: Callable
            The single run function that is being adapted.
            This must return something indexable by metric name.

        run_length: float, optional (default = 1140)
            The run_length for the model
        """
        self.scenario = scenario
        self.metrics = metrics
        self.single_run_func = single_run_func
        self.run_length = run_length

    def single_run(self, replication_number: int) -> dict[str, float]:
        """
        Conduct single run of the simulation.
        Returns a dictionary of {metric_name: value}.
        """
        results_df = self.single_run_func(
            self.scenario,
            rc_period=self.run_length,
            random_no_set=replication_number,
        )
        return {metric: results_df[metric].iloc[0] for metric in self.metrics}
```

Note that `TreatSimAdapter` **does what is says on the tin**: it adapts the model so that its implements the expected interface. I.e. 

```{python}
METRICS = ["01a_triage_wait", "01b_triage_util", "02a_registration_wait"]

ts_model = TreatSimAdapter(
    scenario=Scenario(), metrics=METRICS, single_run_func=single_run
)

ts_model.single_run(1)
```

The Protocol `ReplicationsAlgorithmModelAdapter` is marked as `@runtime_checkable`, which allows us to check at runtime that any supplied model conforms to the expected interface.

```{python}
print(isinstance(ts_model, ReplicationsAlgorithmModelAdapter))
```

## Example usage

### Running `ReplicationsAlgorithm` on a single metric

```{python}
# Set up model with adapter
ts_model = TreatSimAdapter(
    scenario=Scenario(), metrics=["01a_triage_wait"], single_run_func=single_run
)

# Run algorithm
analyser = ReplicationsAlgorithm(observer_factory=ReplicationTabulizer)
nreps, summary_frame = analyser.select(model=ts_model, metrics=["01a_triage_wait"])

# Preview results
print(nreps)
summary_frame.head()
```

You can visualise the tabulized results using the `plotly_confidence_interval_method` function.

```{python}
plotly_confidence_interval_method(
    n_reps=nreps["01a_triage_wait"],
    conf_ints=summary_frame[summary_frame["metric"] == "01a_triage_wait"],
    metric_name="01a_triage_wait",
)
```

### Running `ReplicationsAlgorithm` on a multiple metrics

```{python}
METRICS = ["01a_triage_wait", "01b_triage_util", "02a_registration_wait"]

# Set up model with adapter
ts_model = TreatSimAdapter(
    scenario=Scenario(), metrics=METRICS, single_run_func=single_run
)

# Run algorithm
analyser = ReplicationsAlgorithm(observer_factory=ReplicationTabulizer)
nreps_m, summary_frame_m = analyser.select(model=ts_model, metrics=METRICS)

# Preview results
print(nreps_m)
summary_frame_m.head()
```

```{python}
for metric in METRICS:
    mask = summary_frame_m["metric"] == metric
    fig = plotly_confidence_interval_method(
        n_reps=nreps_m[metric],
        conf_ints=summary_frame_m[mask].reset_index(),
        metric_name=metric,
    )
    fig.show()
```


# How to choose the best setup for a system.
## A tutorial for the Simulation Workshop 2021

**Dr Christine S.M Currie and Tom Monks**

This tutorial introduces Ranking and Selection procedures for stochastic computer simulation.  We focus on in difference zone and optimal computer budget allocation procedures. The procedures are:

* KN and KN++
* OCBA and OCBA-m

> The code implementing the Ranking and Selection procedures explored in this tutorial is available online with an MIT license https://github.com/TomMonks/sim-tools


```{python}
# pylint: disable=missing-module-docstring,invalid-name

# run this first to install the OvS procedures so we can use them in the
# tutorial.
# !pip install sim-tools
```

```{python}
import numpy as np
```

```{python}
from sim_tools.ovs.evaluation import Experiment
from sim_tools.ovs.fixed_budget import OCBA, OCBAM
from sim_tools.ovs.indifference_zone import KN, KNPlusPlus
from sim_tools.ovs.toy_models import (
    custom_gaussian_model,
    gaussian_sequence_model,
    random_gaussian_model,
    ManualOptimiser,
)
```

## Simulation Models

To keep the focus of this tutorial on Ranking and Selection procedures and what they do, and how they perform in practice, the models we introduce are deliberately simple.   The following sections describe how to create a a set of simulation models where the outputs are independent guassian distributions.  There are three options:

* the means of these outputs distributions follow a sequence (e.g. 1, 2, 3, 4, 5) variance is 1.0. 
* the means and variances are user specified.
* the means and variances are randomly generated.

### Creating a simple sequence of normal distributions

A simple model can be created using the `guassian_sequance_model()` function.  This creates a simulation model where the output of each design follows a $N - (\mu_{i}, 1.0)$ :

The function accepts three keyword arguments:

* **start** - int, the first mean in the sequence (inclusive)
* **end** - int, the last mean int the sequence (inclusive)
* **step** - int, the difference between mean i and mean i + 1.

For example, the following code creates a simulation model with 10 designs with means 1 to 10 and unit variance.

```{.python}
model = guassian_sequence_model(1, 10)
```

To create a simulation model with 5 designs where $\mu_{i+1} - \mu_i = 2 $ :

```{.python}
model = guassian_sequence_model(1, 10, step=2)
```

### Creating a custom model with known designs
Instead of a sequence of normal distributions with unit variance, it is possible to create a custom set of designs with varying variances. Use the `custom_guassian_model` function for this task.  For example to create a custom set of designs:

```{.python}
means = [5, 8, 1, 2, 1, 7]
variances = [0.1, 1.2, 1.4, 0.3, 0.8]

custom_model = custom_guassian_model(means, variances)
```

The following code demonstrates how to create a sequence of 100 designs with variances that are 10% of the mean. 

```{.python}
n_designs = 100
means = [i for i in range(n_designs+1)]
variances = [j*0.1 for j in range(n_designs+1)]

custom_model = custom_guassian_model(means, variances)
```


### Creating a model with randomly sampled designs

To create a model with a set of unknown designs (within a specified mean and variance tolerance) use 

```{.python}
mean_low = 1.0
mean_high = 15.0
var_low = 0.1
var_high = 2.0
n_designs = 15

model = random_guassian_model(mean_low, mean_high, var_low, var_high, n_designs)
```

Where:
* **mean_low** - float, a lower bound on the means of the output distributions
* **mean_high** - float, an upper bound on the means
* **var_low** - float, a lower bound on the variance of the output distributions
* **var_high** - float, an upper bound on the variances.
* **n_designs** - int, the number of designs to create.

---

## A manual optimisation framework

Before using the **Optimisation via Simulation** (OvS) procedures, it is recommended that you get a feel for the framework in which the OvS procedures operate.  To do this we will create some models and explore them using a `ManualOptimiser`.  This allows the user to run independent and multiple replications of the model yourself independent of any algorithm.  The `ManualOptimiser` keeps track of the means, variances and number of replications run for each design.

A `ManualOptimiser` object requires two parameters when it is created.  

* model - object, e.g. a model that is a sequence of normal distributions
* n_designs - the number of designs to be considered.

```{python}
manual_opt = ManualOptimiser(model=gaussian_sequence_model(1, 10), n_designs=10)
```

* We can print the optimiser object to help us remember what parameters we set.

```{python}
print(manual_opt)
```

* Follow Law and Kelton's advice and run 3 initial replications of each design.

```{python}
manual_opt.simulate_designs(replications=3)
```

The optimiser keeps track of the allocation of replications across each design.  For efficiency it doesn't store each individual observation, but it does compute a running mean and variance.

* Let's have a look at the replication allocation between and results of each design.

```{python}
print(manual_opt.allocations)
```

```{python}
np.set_printoptions(precision=1)  # this is a hack to view at 1 decimal place.
print(manual_opt.means)
```

```{python}
print(manual_opt.vars)
```

Now let's run 1 additional replication of the top 3 designs. 
Note - in python arrays are **zero indexed**.  This means that design 1 has index 0.

```{python}
manual_opt.simulate_designs(design_indexes=[7, 8, 9], replications=1)
```

```{python}
print(manual_opt.allocations)
```

```{python}
print(manual_opt.means)
```

```{python}
print(manual_opt.vars)
```

### Manual Optimisation of a model with unknown means.

**Now have a go yourself**.  This time create a model with random designs.  Run as many replications of each design as you think is neccessary to make a decision about which design is best.  Here we define best as the design with the largest mean value.

```{python}
# create random problem
rnd_model = random_gaussian_model(
    mean_low=5, mean_high=25, var_low=0.5, var_high=2.5, n_designs=10
)

# create manual optimiser for you to use.
manual_opt = ManualOptimiser(model=rnd_model, n_designs=10)
```

```{python}
# insert your code here...
```

---
# Indifference Zone Ranking and Selection

## Procedure **KN**
The first Ranking and Selection (R&S) algorithm we will explore is **Procedure KN** developed by Kim and Nelson. This is an *Indifference Zone* (IZ) approach to R&S.  IZ procedures exploit the the idea that the performance of one or more of the sub-optimal designs are in fact so close to the best performing system that a user does not care which one is chosen (the decision maker is said to be *indifferent* to this choice).  To do this a user must specify a quantity $\delta$.  Procedure KN provides a theorectical guarantee to select the best system or a system within $\delta$ of the best with probability of $1 - \alpha$.

A key feature of KN is that it only estimates the variance of each design once.  This happens after an initial number of replications (specified by the user) $n_0$.  

To run Kim and Nelson's R&S procedure KN, create an instance of `ovs.indifference_zone.KN`

An object of type KN takes the following parameters:

* **model** - a simulation model
* **n_designs** - int, the number of competing designs to compare
* **delta** - float, the indifference zone
* **alpha** - float, $PCS = 1-\alpha$ (default=0.05)
* **n_0** - int, $n_0$ the number of initial replications (default=2)

```
from sim_tools.ovs.indifference_zone import KN, KNPlusPlus
```

### Example 1.  A simple sequence of normal distributions.

The first problem we will test KN against is selecting the largest mean from a sequence of 10 normal distributions (means raning from 1 to 10) with unit variance.  We want a $PCS = 0.95$ and are indifferent to designs that are within an average of 1.0 of the best.  

For this simple problem, we will follow Law's longstanding advice of setting $n_0 = 5$ i.e 5 initial replications of each design.

#### Setting up KN

```{python}
N_DESIGNS = 10
DELTA = 1.0
ALPHA = 0.05
N_0 = 5

# first we create the simulation model.
model = gaussian_sequence_model(1, N_DESIGNS)

# then we create the KN R&S object and pass in our parameters.
# note that this doesn't run the optimisation yet.
kn = KN(model=model, n_designs=N_DESIGNS, delta=DELTA, alpha=ALPHA, n_0=N_0)
```

```{python}
# quickly check if KN is parameterised as expected
print(kn)
```

#### Running KN
Now that we have created the KN object we can run the solver at any time we choose.  To do this call the `KN.solve()` method.  This method returns a design within the indifference zone $(1 - alpha$) of the time.

```{python}
# this runs KN (you can run this cell multiple times if you wish)
best_design = kn.solve()

# print out the results
print(f"best design\t{best_design}")
print(f"allocations\t{kn._allocations}")
print(f"total reps\t{kn._allocations.sum()}")
```

In general you should see that KN simulates the top 2-3 designs the most with the lower performing designs eliminated early on (possibly after the initial stage of replication).

## Procedure **KN++**

Kim and Nelson's KN++ procedure is an enhancement on KN.  KN++ introduces an **update** step that recalculates the variance of the differences between designs.  

To run procedure KN++, create an instance of `ovs.indifference_zone.KNPlusPlus`

### Example 2. Solving the simple problem using KN++

We will run KN++ on the same simulation model as KN.  In general you should see that KN++ uses less replication to find a design within the indifference zone.

```{python}
knpp = KNPlusPlus(model=model, n_designs=N_DESIGNS, delta=DELTA, alpha=ALPHA, n_0=N_0)

best_design = knpp.solve()
print(f"best design\t{best_design}")
print(f"allocations\t{knpp._allocations}")
print(f"total reps\t{knpp._allocations.sum()}")
```

## Exercise 1. Comparing KN and KN++ on a larger design.

Your task is to compare KN and KN++ optimising a simulation model with 100 competing designs.  These designs are a sequence of 100 normal distributions with unit variance.  The objective is to find the design with the maximum mean.

What do you notice about the replications required from each method?

```{python}
# the code to set up the simulation model has been provided below.
N_DESIGNS = 100
DELTA = 1.0
ALPHA = 0.05
N_0 = 5

# first we create the simulation model.
model = gaussian_sequence_model(1, N_DESIGNS)
```

```{python}
# Your code goes here...

# Hint 1. you need to create KN and KNPlusPlus objects, pass in the simulation
# model and run them.

# Hint 2. try running the models a few times.

```

## Exercise 2: A more difficult optimisation problem

We will now explore how the IZ procedures perform on a slightly harder optimisation problem where the mean of the designs are closer together and have differing variances.

The problem is: 

```{.python}
means = [4, 4.1, 4.2, 4, 4.1, 4.3, 4, 4.1, 4.2, 4.2]
variances = [1, 1, 1, 0.1, 0.1, 10, 10, 10, 10, 0.1]
```

The 'optimal' mean design is at index 5 (4.3). It is a very noisy design.   We will set a $\delta$ so that there are multiple designs within the IZ.

Your task is to compare the performance of KN and KN++ given the following parameters:

```{.python}
N_DESIGNS = 10
N_0 = 10
DELTA = 0.15
ALPHA = 0.05
```

We are also setting a random seed before running each procedure.  This ensures that the first stage of replications returns the same samples.

Questions: 
* Do the methods return the 'optimal' design or another design within the indifference zone?
* Which procedure require the most replication effort?  Does it vary?
* Are there any differences in designs selected and the number of replications needed?  Try different values of `SEED`.
* What happens if you try difference values of $n_0$ e.g. 20 and 50?
* What happens if you change $\alpha$ to 0.1? 


<div class="alert alert-block alert-info">
<b>RUNTIME WARNING!</b> This problem requires many more replications.  It will take a few seconds for the procedures to return a design.</div>

```{python}
# set problem parameters and create simulation model.

N_DESIGNS = 10
N_0 = 20
DELTA = 0.15
ALPHA = 0.05

# change the value of seed.
SEED = 999

means = [4, 4.1, 4.2, 4, 4.1, 4.3, 4, 4.1, 4.2, 4.2]
variances = [1, 1, 1, 0.1, 0.1, 10, 10, 10, 10, 0.1]

guass_model = custom_gaussian_model(means, variances)

# print out which design indexes are in the indifference zone
print(np.where(4.3 - np.array(means) <= DELTA)[0])
```

```{python}
# create an instance of KN and solve.

kn = KN(model=guass_model, n_designs=N_DESIGNS, delta=DELTA, alpha=ALPHA, n_0=N_0)

np.random.seed(SEED)
best_design = kn.solve()

# print out the results
print(f"best design\t{best_design}")
print(f"allocations\t{kn._allocations}")
print(f"total reps\t{kn._allocations.sum()}")
```

```{python}
# use KN++ on the same problem
knpp = KNPlusPlus(
    model=guass_model, n_designs=N_DESIGNS, delta=DELTA, alpha=ALPHA, n_0=N_0
)

# we use the same random seed to help with comparison.
np.random.seed(SEED)
best_design = knpp.solve()

# print out the results
print(f"best design\t{best_design}")
print(f"allocations\t{knpp._allocations}")
print(f"total reps\t{knpp._allocations.sum()}")
```

# Optimal Computing Budget Allocation (OCBA)

We will first consider the scenario where the objective is find the optimal design.  The distinguishing feature of OCBA procedures from IZ procedures a fixed budget formulation of the optimisation problem.  I.e. achieve the best $PCS$ with the budget available.

Single solution OCBA object can be created by importing `ovs.fixed_budget.OCBA`

An object of type `OCBA` takes the following parameters:

* **model** - a simulation model
* **n_designs** - int, the number of competing designs to compare
* **budget** - int, the total number of replications to allocate across designs
* **delta** - int, the incremental amount of replications to allocate at each round
* **n_0** - int, $n_0$ the number of initial replications (default=5)
* **obj** - str, 'min' if minimisation; 'max' if maximisation (default='min')

```
from sim_tools.ovs.fixed_budget import OCBA, OCBAM
```

## Example 3. Applying OCBA to the simple problem.

We will first apply OCBA to the simple sequence of normal random variables that we used in Example 1.  

Initially we will set a simulation budget of 1000 replications. This is overkill, but makes it easier to see what OCBA is doing.

```{python}
N_DESIGNS = 10

# remember that delta is not the same as in IZ!
# it is the amount of replication to allocate at each stage.
DELTA = 5

# the initial stage will run 5 reps of each design.
N_0 = 5

BUDGET = 1000

# first we create the simulation model.
model = gaussian_sequence_model(1, N_DESIGNS)

ocba = OCBA(
    model=model, n_designs=N_DESIGNS, budget=BUDGET, delta=DELTA, n_0=N_0, obj="max"
)
```

```{python}
print(ocba)
```

call the `solve()` method to run the optimisation.  If needed, run OCBA a few times.

```{python}
best_design = ocba.solve()

# print out the results
print(f"best design\t{best_design}")
print(f"allocations\t{ocba._allocations}")
print(f"total reps\t{ocba._allocations.sum()}")
```

* Run the cell above a few times.  You should see that OCBA is allocating most of the budget across the top 3 designs.

## Exercise 3. Exploring OCBA and $PCS$

OCBA conceptualises optimisation as a fixed budget problem.  Let's run an experiment where we estimate what $PCS$ is actually being achieved under a given budget.  KN++ was reporting a range of 70-80 replications.  Let's try 75.

To help with this test you can import an Experiment object: `ovs.evaluation.Experiment`

An `Experiment` object takes the following parameters:

* **env** - the simulated environment aka the model.
* **procedure** - the method to test e.g. KNPlusPlus or OCBA
* **best_index** - the index of the best design in the simulation model (zero based).
* **objective** - 'max' or 'min'
* **replications** - the number of repeats of the experiment to run.

**Your task is to run the experiment with different budgets.  E.g. 60, 70 and 75.  Remember that you cannot go lower than N_0 + DELTA**

```
from sim_tools.ovs.evaluation import Experiment
```


<div class="alert alert-block alert-info">
<b>RUNTIME WARNING!</b> The experiment will will take a few seconds to complete. </div>

```{python}
# this is the budget you can vary.
BUDGET = 75

# you could also vary delta.  What happens if delta = 1?
DELTA = 5

N_DESIGNS = 10
N_0 = 5

ocba = OCBA(
    model=model, n_designs=N_DESIGNS, budget=BUDGET, delta=DELTA, n_0=N_0, obj="max"
)

# create an experiment and repeat 1000 times.
exp = Experiment(
    env=model, procedure=ocba, best_index=9, objective="max", replications=1000
)

results = exp.execute()

# print the probability of correct selection achieved.
print(f"PCS={results.p_correct_selections}")
```

## Exercise 4. Calculating the Expected Opportunity Cost.

Another way to evaluate a procedures performance is to calculate the Expected Opportunity Cost.  This is the average difference between the best mean and the selected mean in each repeat of the experiment.

**Task: try varying the budget, delta and n_0.  What is the effect?**

We will only run a small number of repeats of the experiments to save runtime!

```{python}
# KN++ needed a large budget to reach its desired precision.
# We will try a smaller budgets here
BUDGET = 20000
DELTA = 5
N_DESIGNS = 10
N_0 = 5

means = [4, 4.1, 4.2, 4, 4.1, 4.3, 4, 4.1, 4.2, 4.2]
variances = [1, 1, 1, 0.1, 0.1, 10, 10, 10, 10, 0.1]

guass_model = custom_gaussian_model(means, variances)

ocba = OCBA(
    model=guass_model, n_designs=N_DESIGNS, budget=BUDGET, delta=5, n_0=50, obj="max"
)

# create an experiment and repeat 10 times (in practice we would use more!)
exp = Experiment(
    env=guass_model, procedure=ocba, best_index=5, objective="max", replications=10
)

results = exp.execute()

# the probability of correct selection achieved.
print(f"PCS={results.p_correct_selections}")

# IF YOU REMEMBER FROM Example 2 THE DESIGNS IN THE IZ WERE [2 5 8 9]
# Let's print out the selections made...
print(f"selected designs: {results.selections}")

# the average of the best_mean - selected_design_mean for each rep of the exp.
print(f"EOC={results.expected_opportunity_cost}")
```

## Optimal Computing Budget Allocation Top M (OCBA-m)

In many situations it is not possible or desirable to model all of the detail in a simulation model.   A procedure that offers a top performing group of designs is more useful in these circumstances.  It allows a decision maker to choose between good designs while taking other factors into account.

So far we have only considered selecting the best system.  OCBA-m extended OCBA to identify the **top m designs**.  

A top-m solution OCBA object can be created by importing `ovs.fixed_budget.OCBAM`

An object of type `OCBAM` takes the following parameters:

* **model** - a simulation model
* **n_designs** - int, the number of competing designs to compare
* **budget** - int, the total number of replications to allocate across designs
* **delta** - int, the incremental amount of replications to allocate at each round
* **n_0** - int, $n_0$ the number of initial replications (default=5)
* **m** - int, $m$ the number of top designs to return.  m must be >=2.  For m = 1 see OCBA.
* **obj** - str, 'min' if minimisation; 'max' if maximisation (default='min')

## Example 4. Selecting the top 2 designs.

```{python}
N_DESIGNS = 10
BUDGET = 1000
DELTA = 5
N_0 = 5

# you can vary this parameter to change the number selected.
M = 3

guass_model = gaussian_sequence_model(1, N_DESIGNS)

ocbam = OCBAM(
    model=guass_model,
    n_designs=N_DESIGNS,
    budget=BUDGET,
    delta=DELTA,
    n_0=N_0,
    m=M,
    obj="max",
)
```

```{python}
print(ocbam)
```

```{python}
# print out the results
best_designs = ocbam.solve()
print(f"best designs\t{best_designs}")
print(f"allocations\t{ocbam._allocations}")
print(f"total reps\t{ocbam._allocations.sum()}")
```

## Exercise 5. Select the top 3 designs out of 100 designs.

Task: Use OCBA-m to select the **smallest** 3 designs from the example problem.  What impact does varying the budget, delta and $n_0$ have on performance?

```{python}
N_DESIGNS = 100
BUDGET = 1000
DELTA = 1
N_0 = 5

# you can vary this parameter to change the number selected.
M = 3

guass_model = gaussian_sequence_model(1, N_DESIGNS)


# insert your code here
# ocbam = ?
```

```{python}
# print out the results
best_designs = ocbam.solve()
print(f"best designs\t{best_designs}")
print(f"allocations\t{ocbam._allocations}")
print(f"total reps\t{ocbam._allocations.sum()}")
```

# THE END

What did you think of the procedures?