Upload data¶

Uploading data is done by using the aind-data-transfer-service (docs) which handles running containerized tasks for data copying, compression, metadata gathering, and final upload to S3 and Code Ocean.

Job types and upload scripts¶

In general, most users should interact with the transfer service by requesting data upload via watchdog (contact SIPE for setup) or through the aind-data-transfer-service using the REST API and upload scripts. Users control what tasks are run on their data through job types and that parameters that they include in their upload scripts.

For example, this upload script demonstrates how to setup the upload parameters for a standard ecephys data asset using the "default" job_type. You can view all available job_type options. Please reach out to the Data & Infrastructure team in Scientific Computing to develop custom job types for your data assets.

Job types define convenient defaults, but you are not locked into them — any parameter set by the job type can be overridden in your upload script. For example, to pin the metadata mapper to a specific version rather than using the job type’s default:

from aind_data_transfer_service.models.core import Task, UploadJobConfigsV2

gather_preliminary_metadata = Task(
    image_version="v1.1.0",  # overrides the job_type default
    job_settings={"metadata_dir": "/path/to/your/data"},
)

upload_job_configs = UploadJobConfigsV2(
    job_type="vr_foraging_fiber",  # all other defaults still come from the job_type
    ...
    tasks={"gather_preliminary_metadata": gather_preliminary_metadata},
)

Available mapper versions are listed here. For a complete reference of all parameters you can control, see this example script.

GatherMetadataJob¶

The GatherMetadataJob is the primary tool used to assemble and validate metadata during upload of data assets. The job handles construction of the data_description, subject, and procedures as well as merging and validating instrument and acquisition metadata. It also runs a full validation step on all available metadata files to ensure cross-compatibility.

The main settings you should be concerned with are:

instrument_settings.instrument_id: this field triggers the job to pull an instrument.json file from the metadata-service (where you previously uploaded it).
data_description_settings.tags/group/restrictions/data_summary: each of these fields is meta-metadata about your project and should be accurately filled out, if possible. Please see the DataDescription documentation for details about each field.

The settings for the GatherMetadataJob are typically set inside of your upload script or as part of the job_type.

Merge rules¶

When can multiple files be merged?¶

When data is acquired simultaneously using two or more distinct instruments (e.g., a behavior instrument and a physiology instrument), multiple instrument.json and/or acquisition.json metadata files can be provided. The GatherMetadataJob will merge these files during upload via the aind-data-transfer-service.

File Naming Convention¶

Each file must follow the naming pattern <metadata_type>*.json where * is any string. We recommend using modalities to organize the individual files:

instrument_behavior.json and instrument_ecephys.json
acquisition_behavior.json and acquisition_ecephys.json

Contraints¶

Unique fields must match: Certain identifier fields that should be unique across the dataset (like subject_id) must have identical values in all files being merged. If these fields conflict, the merge will fail and your upload job will be rejected. An important exception is the instrument_id field. If two or more instrument JSON files are joined, the merged instrument JSON file will have an instrument_id that is the string combination of the IDs of the unique instruments,
No shared devices, with the exception of a single shared clock: In general, two instruments can be merged if and only if there are no shared devices between them. Devices are identified by their name field. If the same device name appears in both instrument files, they should really be defined as a single instrument, not two separate ones.

Exception for clock synchronization: When synchronizing data acquisition across multiple instruments (e.g., recording behavior and physiology simultaneously), a shared clock device is permitted. For AIND instruments this must be a HarpDevice configured as a clock generator (HarpDevice.is_clock_generator=True).
Enable validation: It is strongly recommended to turn on the raise_if_invalid setting in the GatherMetadataJob job settings. This validates that the merge will succeed before upload, making it much easier to identify and fix problems compared to dealing with a raw data asset with broken metadata.
Python merging: You can test merging locally in Python using the + operator:

from aind_data_schema.core.instrument import Instrument
from aind_data_schema.core.acquisition import Acquisition

# Merge instruments
instrument1 = Instrument.model_validate_json(json_string_1)
instrument2 = Instrument.model_validate_json(json_string_2)
merged_instrument = instrument1 + instrument2

# Merge acquisitions  
acquisition1 = Acquisition.model_validate_json(json_string_1)
acquisition2 = Acquisition.model_validate_json(json_string_2)
merged_acquisition = acquisition1 + acquisition2

Implementation details¶

The exact merge logic for each metadata type is defined in the __add__ methods in the aind-data-schema repository. See the following files:

src/aind_data_schema/core/instrument.py
src/aind_data_schema/core/acquisition.py
tests/test_composability_merge.py (for test examples)