Managed Data Model

Identifiers

Protocols and typing tools for identifying SCALE-MS objects and interfaces.

TODO: Publish some schema to scalems.org for easy reference and verifiable UUID hashing.

class scalems.identifiers.EphemeralIdentifier[source]

Process-scoped UUID based identifier.

Not reproducible. Useful for tracking objects within a single process scope.

__init__(node=None, clock_seq=None)[source]

bytes()[source]

A consistent bytes representation of the identity.

The core interface provided by Identifiers.

Note that the identity (and the value returned by self.bytes()) must be immutable for the life of the object.

class scalems.identifiers.FingerprintHash

The fingerprint hash is a 32-byte sequence containing a SHA256 digest.

alias of bytes

class scalems.identifiers.Identifier[source]

SCALE-MS object identifiers support this protocol.

Identifiers may be implemented in terms of a hashing scheme, RFC 4122 UUID, or other encoding appropriate for the scope of claimed uniqueness and reusability (cacheable).

Namespace UUIDs are appropriate for strongly specified names, such as operation implementation identifiers. The 48 bits of data are sufficient to identify graph nodes at session scope. At workflow scope, we need additional semantics about what should persist or not.

Concrete data warrants a 128-bit or 256-bit checksum.

Design notes

Several qualifications may be necessary regarding the use of an identifier, which may warrant annotation accessible through the Identifier instance. Additional proposed data members include:

scope: Scope in which the Identifier is effective and unique.
reproducible: Whether results will have the same identity if re-executed. Relates to behaviors in situations such as missing cache data.
concrete: Is this a concrete object or something more abstract?

__init__(*args, **kwargs)

abstract bytes()[source]

A consistent bytes representation of the identity.

The core interface provided by Identifiers.

Note that the identity (and the value returned by self.bytes()) must be immutable for the life of the object.

Return type:: bytes

encode()[source]

Get a canonical encoding of the identifier as a native Python object.

This is the method that will be used to produce serialized workflow records.

By default, the string representation (self.__str__()) is used. Subclasses may override, as long as suitable decoding is possible and provided.

Return type:: dict | list | tuple | str | int | float | bool | None

concrete: bool: Is this a concrete object or something more abstract?

class scalems.identifiers.NamedIdentifier[source]

A name with strong identity semantics, represented with a UUID.

__init__(nested_name)[source]

Parameters:: nested_name (Sequence[str]) –

bytes()[source]

A consistent bytes representation of the identity.

The core interface provided by Identifiers.

Note that the identity (and the value returned by self.bytes()) must be immutable for the life of the object.

encode()[source]

Get a canonical encoding of the identifier as a native Python object.

This is the method that will be used to produce serialized workflow records.

By default, the string representation (self.__str__()) is used. Subclasses may override, as long as suitable decoding is possible and provided.

Return type:: dict | list | tuple | str | int | float | bool | None

class scalems.identifiers.OperationIdentifier[source]

Python structure to identify an API Operation implementation.

Operations are identified with a nested scope. The OperationIdentifier is a sequence of identifiers such that the operation_name() is the final element, and the preceding subsequence comprises the namespace().

Conventional string representation of the entire identifier uses a period (.) delimiter.

class scalems.identifiers.ResourceIdentifier[source]

__init__(fingerprint)[source]

Parameters:: fingerprint (bytes) –

bytes()[source]

A consistent bytes representation of the identity.

The core interface provided by Identifiers.

Note that the identity (and the value returned by self.bytes()) must be immutable for the life of the object.

class scalems.identifiers.TypeDataDescriptor[source]

Implement the dtype attribute.

The TypeDataDescriptor object is instantiated to implement the BasicSerializable.base_type dynamic attribute.

name: Name of the attribute provided by the data descriptor.

base

TypeIdentifier associated with the Python class.

Type:: MutableMapping[type, scalems.identifiers.TypeIdentifier]

name can be provided at initialization, but is overridden during class definition when TypeDataDescriptor is used in the usual way (as a data descriptor instantiated during class definition).

At least for now, name is required to be _dtype.

attr_name is derived from name at access time. For now, it is always __dtype.

Instances of the Python class may have their own dtype. For the SCALE-MS data model, TypeIdentifier is an instance attribute rather than a class attribute. If an instance did not set self.__dtype at initialization, the descriptor returns base for the instance’s class.

base is the (default) SCALEMS TypeIdentifier for the class using the descriptor. For a class using the data descriptor, base is inferred from the class __module__ and __qualname__ attributes, if not provided through the class definition.

A single data descriptor instance is used for a class hierarchy to encapsulate the meta-programming for UnboundObject classes without invoking Python metaclass arcana (so far). At module import, a TypeDataDescriptor is instantiated for BasicSerializable._dtype. The data descriptor instance keeps a weakref.WeakKeyDict mapping type objects (classes) to the TypeDataDescriptor details for classes other than BasicSerializable. (BasicSerializable._dtype always produces TypeIdentifier(('scalems', 'BasicSerializable')).) The mapping is updated whenever BasicSerializable is subclassed.

__init__(name=None, base_type=None)[source]

Parameters:

name (str) –
base_type (TypeIdentifier) –

property attr_name: Name of the instance data member used by this descriptor for storage.

class scalems.identifiers.TypeIdentifier[source]

classmethod copy_from(typeid)[source]

Create a new TypeIdentifier instance describing the same type as the source.

Return type:: TypeIdentifier

Serialization

Provide encoding and decoding support for serialized workflow representations.

Reference https://docs.python.org/3/library/json.html#json.JSONDecoder and https://docs.python.org/3/library/json.html#py-to-json-table describe the trivial Python object conversions. The core SCALE-MS encoder / decoder needs to manage the conversion of additional types (scalems or otherwise, e.g. bytes) to/from these basic Python types.

For JSON, we can provide an encoder for the cls parameter of json.dumps() and we can provide a key-value pair processing dispatcher to the object_pairs_hook parameter of json.loads()

We may also implement serialization schemes for document formats other than JSON, such as the CWL schema.

class scalems.serialization.BasicSerializable[source]

__init__(data, *, dtype, shape=(1,), label=None, identity=None)[source]

class scalems.serialization.Shape[source]

Describe the data shape of a SCALEMS object.

__init__(elements)[source]

Initial implementation requires a sequence of integers.

Software requirements include symbolic elements, TBD.

Parameters:: elements (Iterable) –

static __new__(cls, elements)[source]

Parameters:: elements (Iterable) –

Filesystem and data locality context

Protocols

Provide an interface for managing filesystem objects in a workflow.

We distinguish between File and FileReference types.

A File is a concrete representation of a local filesystem object.

A FileReference is more general. It may represent a non-local object (accessible by a URI). We currently use FileReference objects exclusively for resources managed by scalems. (See scalems.store.FileStore.)

This module has minimal dependencies on other scalems implementation details so that it can be included easily by scalems modules without circular dependencies.

class scalems.file.AbstractFile[source]

Abstract base for file types.

Extends os.PathLike with the promise of a fingerprint() method and some support for access control semantics.

See describe_file() for a creation function.

__init__(*args, **kwargs)

abstract fingerprint()[source]

Get a checksum or other suitable fingerprint for the file data.

Note that the fingerprint could change if the file is written to during the lifetime of the handle. This is too complicated to optimize, generally, so possible caching of fingerprint results is left to the caller.

Return type:: bytes

access: AccessFlags

Declared access requirements.

Not yet enforced or widely used.

class scalems.file.AbstractFileReference[source]

Reference to a managed file.

A FileReference is an os.PathLike object.

If filereference.is_local() is True, os.fspath() will return a str for the referenced file’s path in the local filesystem. If the file is not available locally, an exception will be raised.

Example

handle: scalems.file.AbstractFileReference = filestore.add_file(…)

__init__(*args, **kwargs)

as_uri(context=None)[source]

Get an appropriate URI for a referenced file, whether or not it is available.

For local file stores, the result is the same as for obj.path().as_uri() except that no exception is raised if the object does not yet exist locally.

When context is specified, the returned URI may not be resolvable without additional contextual facilities. For instance, scalems.radical contexts may return a radical.saga.Url using one of the scheme extensions for RADICAL Pilot staging directives.

Return type:: str

abstract filestore()[source]: Get a handle to the managing filestore for this reference.

abstract is_local(context=None)[source]

Check for file existence in a given context.

By default, file existence is checked in the currently active FileStore. If context is specified, checks whether the referenced file is locally available in the given context.

Return type:: bool

abstract key()[source]

Get the identifying key for the file reference.

The return value may not be human-readable, but may be necessary for locating the referenced file through the filestore.

Return type:: ResourceIdentifier

abstract async localize(context=None)[source]

Make sure that the referenced file is available locally for a given context.

With no arguments, localize() affects the currently active FileStore. Specific (e.g. remote) data stores can be addressed by providing context.

If the file is not yet localized, yields until the necessary data transfers have completed.

Raises:: TBD –
Return type:: AbstractFileReference

class scalems.file.AccessFlags[source]

class scalems.file.BaseBinary[source]

Base class for binary file types.

__init__(path, access=AccessFlags.READ)[source]

Parameters:

path (str | Path | PathLike) –
access (AccessFlags) –

fingerprint()[source]

Get a checksum or other suitable fingerprint for the file data.

Note that the fingerprint could change if the file is written to during the lifetime of the handle. This is too complicated to optimize, generally, so possible caching of fingerprint results is left to the caller.

Return type:: bytes

class scalems.file.BaseText[source]

Base class for text file types.

__init__(path, access=AccessFlags.READ, encoding=None)[source]

Parameters:: access (AccessFlags) –

fingerprint()[source]

Get a checksum or other suitable fingerprint for the file data.

Note that the fingerprint could change if the file is written to during the lifetime of the handle. This is too complicated to optimize, generally, so possible caching of fingerprint results is left to the caller.

Return type:: bytes

scalems.file.describe_file(obj, mode='rb', encoding=None, file_type_hint=None)[source]

Describe an existing local file.

Parameters:

obj (str | Path | PathLike) –
encoding (str) –

Return type:

AbstractFile

Data store

Manage non-volatile data for SCALE-MS in a filesystem context.

Implement several interfaces from scalems.file.

The data store must be active before attempting any manipulation of the managed workflow. The data store must be closed by the time control returns to the interpreter after leaving a managed workflow scope (and releasing a WorkflowManager instance).

We prefer to do this with the Python context manager protocol, since we already rely on scalems.workflow.scope().

We can also use a contextvars.ContextVar to hold a weakref to the FileStore, and register a finalizer to perform a check, but the check could come late in the interpreter shutdown and we should not rely on it. Also, note the sequence with which module variables and class definitions are released during shutdown.

exception scalems.store.StaleFileStore[source]: The backing file store is not consistent with the known (local) history.

class scalems.store.FileReference[source]

Provide access to a File managed by a FileStore.

Implements AbstractFileReference.

__init__(filestore, key)[source]

Parameters:

filestore (FileStore) –
key (ResourceIdentifier | str) –

as_uri(context=None)[source]

Get an appropriate URI for a referenced file, whether or not it is available.

For local file stores, the result is the same as for obj.path().as_uri() except that no exception is raised if the object does not yet exist locally.

When context is specified, the returned URI may not be resolvable without additional contextual facilities. For instance, scalems.radical contexts may return a radical.saga.Url using one of the scheme extensions for RADICAL Pilot staging directives.

Return type:: str

filestore()[source]

Get a handle to the managing filestore for this reference.

Return type:: FileStore

is_local(context=None)[source]

Check for file existence in a given context.

By default, file existence is checked in the currently active FileStore. If context is specified, checks whether the referenced file is locally available in the given context.

Return type:: bool

key()[source]

Get the identifying key for the file reference.

The return value may not be human-readable, but may be necessary for locating the referenced file through the filestore.

Return type:: ResourceIdentifier

async localize(context=None)[source]

Make sure that the referenced file is available locally for a given context.

With no arguments, localize() affects the currently active FileStore. Specific (e.g. remote) data stores can be addressed by providing context.

If the file is not yet localized, yields until the necessary data transfers have completed.

Raises:: TBD –
Return type:: FileReference

class scalems.store.FileStore[source]

Handle to the SCALE-MS nonvolatile data store for a workflow context.

Used as a Container, serves as a Mapping from identifiers to resource references.

TODO: Consider generalizing the value type of the mapping for general Resource types, per the scalems data model.

__init__(*, directory)[source]

Assemble the data structure.

Users should not create FileStore objects directly, but with initialize_datastore() or through the WorkflowManager instance.

Once initialized, caller is responsible for calling the close() method. The easiest way to do this is to avoid creating the FileStore directly, and instead use a FileStoreManager object.

No directory in the filesystem should be managed by more than one FileStore. The FileStore class maintains a registry of instances to prevent instantiation of a new FileStore for a directory that is already managed.

Raises:: ContextError – if attempting to instantiate for a directory that is already managed.
Parameters:: directory (Path) –

async add_file(obj, _name=None)[source]

Add a file to the file store.

Not thread safe. User is responsible for serializing access, as necessary.

We require file paths to be wrapped in a special type so that we can enforce that some error checking is possible before the coroutine actually runs. See

This involves placing (copying) the file, reading the file to fingerprint it, and then writing metadata for the file. For clarity, we also rename the file after fingerprinting to remove a layer of indirection, even though this generates additional load on the filesystem.

The caller is allowed to provide a _name to use for the filename, but this is strongly discouraged. The file will still be fingerprinted and the caller must be sure the key is unique across all software components that may by using the FileStore. (This use case should be limited to internal use or core functionality to minimize namespace collisions.)

If the file is provided as a memory buffer or io.IOBase subclass, further optimization is possible to reduce the filesystem interaction to a single buffered write, but such optimizations are not provided natively by FileStore.add_file(). Instead, such optimizations may be provided by utility functions that produce AbstractFileReference objects that FileStore.add_file() can consume.

Raises:

scalems.exceptions.DuplicateKeyError – if a file with the same ResourceIdentifier is already under management.
StaleFileStore – if the filestore state appears invalid.

Parameters:

obj (AbstractFile | AbstractFileReference) –
_name (str) –

Return type:

FileReference

In addition to TypeError and ValueError for invalid inputs, propagates exceptions raised by failed attempts to access the provided file object.

See also

scalems.file.describe_file()

add_task(identity, **kwargs)[source]

Add a new task entry to the metadata store.

Not thread-safe. (We may need to manage this in a thread-safe manner, but it is probably preferable to do that by delegating a single-thread or thread-safe manager to serialize metadata edits.)

Parameters:: identity (Identifier) –

close()[source]

Flush, shut down, and disconnect the FileStore from the managed directory.

Not robustly thread safe. User is responsible for serializing access, as necessary, though ideally close() should be called exactly once, after the instance is no longer in use.

Raises:

StaleFileStore if called on an invalid or outdated handle. –
ScopeError if called from a disallowed context, such as from a forked process. –

flush()[source]

Write the current metadata to the backing store, if there are pending updates.

For atomic updates and better forensics, we write to a new file and move the file to replace the previous metadata file. If an error occurs, the temporary file will linger.

Changes to metadata must be made with the update lock held and must end by setting the “dirty” condition (notifying the watchers of “dirty”). flush() should acquire the update lock and clear the “dirty” condition when successful. We can start with “dirty” as an Event. If we need more elaborate logic, we can use a Condition. It should be sufficient to use asyncio primitives, but we may need to use primitives from the threading module to allow metadata updates directly from RP task callbacks.

property datastore: Path

Path to the data store for the workflow managed at directory.

scalems creates a subdirectory at the root of a workflow in which to manage internal data. Editing files or directories in this subdirectory will affect workflow state and validity.

The name reflects the SCALE-MS data format version and is not user-configurable.

property directory: Path

The work directory under management.

Generally, this is the current working directory when a scalems script is launched. The user is free to use the work directory to stage input and output files with the exception of the single scalems datastore directory. See FileStore.datastore.

scalems will create a hidden “lock” directory briefly when starting up or shutting down. You may need to manually remove the lock directory after a particularly bad crash.

property filepath: Path

Path to the metadata backing store.

This is the metadata file used by scalems to track workflow state and file-backed data. Its format is closely tied to the scalems API level.

property files: Mapping[ResourceIdentifier, Path]

Proxy to the managed files metadata.

The proxy is read-only, but dynamic. A reference to FileStore.files will provide a dynamic view for later use.

Warning

The returned FilesView will extend the lifetime of the metadata structure, but will not extend the lifetime of the FileStore itself.

property log

Get access to the operation log.

Interface TBD…

class scalems.store.FilesView[source]

Read-only viewer for files metadata.

Provides a mapping from ResourceIdentifier to the stored filesystem path.

Warning

The Path objects represent paths in the context of the filesystem of the FileStore. If the FileStore does not represent a local filesystem, then a file:// URIs from value.as_uri() is specific to the remote filesystem.

__init__(files)[source]

Parameters:: files (Mapping[str, str]) –

class scalems.store.Metadata[source]

Simple container for metadata at run time.

This is intentionally decoupled from the non-volatile backing store or any usage protocols. Instances are managed through the FileStore.

We will probably prefer to provide distinct interfaces for managing non-volatile assets (files) and for persistent workflow state (metadata), which may ultimately be supported by a single FileStore. That would likely involve replacing rather than extending this class, so let’s keep it as a simple container for now: Make sure that this dataclass remains easily serialized and deserialized.

For simplicity, this data structure should be easily serializable. Where necessary, mappings to and from non-trivial types must be documented.

__init__(instance, files=<factory>, objects=<factory>)

Parameters:

instance (int) –
files (MutableMapping[str, str]) –
objects (MutableMapping[str, dict]) –

Return type:

None

files: MutableMapping[str, str]

Managed filesystem objects.

Maps hexadecimal-encoded (str) ResourceIdentifiers to filesystem paths (str).

instance: int: Unique identifier for this data store.

objects: MutableMapping[str, dict]

Dictionary-encoded objects.

Refer to scalems.serialization for encoding schema.

scalems.store.filestore_generator(directory=None)[source]

Generator function to manage a FileStore instance.

Initializes a FileStore and yields (a weak reference to) it an arbitrary number of times. When the generator is garbage collected, or otherwise interrupted, properly closes the FileStore before finishing.

This is a simple “coroutine” using the extended generator functionality from PEP 342. .. seealso:: https://docs.python.org/3/howto/functional.html#passing-values-into-a-generator

scalems.store.get_file_reference(obj, filestore=None, mode='rb')[source]

scalems.store.get_file_reference(obj, filestore=None, mode='rb')

scalems.store.get_file_reference(obj, *args, **kwargs)

Get a FileReference for the provided object.

If filestore is provided, use the given FileStore to manage the FileReference. Otherwise, use the FileStore for the current WorkflowManager.

This is a dispatching function. Handlers for particular object types must are registered by decorating with @get_file_reference.register. See functools.singledispatch().

Parameters:

obj – identify the (managed) file
filestore – file resource management backing store
mode (str) – file access mode

Return type:

Coroutine[None, None, FileReference]

If obj is a Path or PathLike object, add the identified file to the file store. If obj is a ResourceIdentifier, attempts to retrieve a reference to the identified filesystem object.

Note

If obj is a str or bytes, obj is interpreted as a ResourceIdentifier, not as a filesystem path.

TODO: Either try to disambiguate str and bytes, or disallow and require stronger typed inputs.

This involves placing (copying) the file, reading the file to fingerprint it, and then writing metadata for the file. For clarity, we also rename the file after fingerprinting to remove a layer of indirection, even though this generates additional load on the filesystem.

In addition to TypeError and ValueError for invalid inputs, propagates exceptions raised by failed attempts to access the provided file object.

TODO: Try to detect file type. See, for instance, https://pypi.org/project/python-magic/