Reference

proksee_batch

Proksee Batch.

proksee_batch.main

Main module for the Proksee Batch tool.

This file defines the proksee-batch command-line interface and implements the high-level logic of the tool.

proksee_batch.__main__.generate_js_data(output_dir, genome_info, run_date, input_dir)

Generates a JavaScript file with genome information wrapped in a variable assignment.

Parameters:
  • output_dir (str)

  • genome_info (List[Dict[str, Any]])

  • run_date (str)

  • input_dir (str)

Return type:

None

proksee_batch.__main__.get_description_from_metadata(metadata_path)

Extracts the accession and description from a metadata file.

Parameters:

metadata_path (str) – Path to the metadata file.

Returns:

The description extracted from the metadata file.

Return type:

str

proksee_batch.__main__.handle_error_exit(error_message, exit_code=1)

Handles errors by printing a message to sys.stderr and exiting the program.

Parameters:
  • error_message (str) – The error message to be printed.

  • exit_code (int) – The exit code to be used for sys.exit. Defaults to 1.

Return type:

None

Exits:

SystemExit: Exits the program with the provided exit code.

proksee_batch.download_example_input

Downloads GenBank files from NCBI FTP site.

proksee_batch.download_example_input.download_example_genbank_files(genome_dir)

Downloads GenBank files from NCBI FTP site.

Parameters: genome_dir (str): The path to the directory where the GenBank files should be saved.

Returns: None

Parameters:

genome_dir (str)

Return type:

None

proksee_batch.download_example_input.download_example_input(output_dir)

Downloads example GenBank, BLAST, and BED files for several example bacterial genomes.

Parameters:

output_dir (str)

Return type:

None

proksee_batch.download_example_input.download_file(url, local_filename)

Downloads a file from a given URL and saves it to the local file system.

Parameters: url (str): The URL of the file to be downloaded. local_filename (str): The local path, including filename, where the file should be saved.

Returns: None

Parameters:
  • url (str)

  • local_filename (str)

Return type:

None

proksee_batch.download_example_input.download_genbank_file(output_dir, url)

Downloads GenBank files from NCBI FTP site.

Parameters: output_dir (str): The path to the directory where the GenBank files should be saved. url (str): The URL of the file to be downloaded.

Returns: None

Parameters:
  • output_dir (str)

  • url (str)

Return type:

None

proksee_batch.generate_report_html

Code for generating an HTML report file with a table containing links to Proksee projects and images for each sample. A single genome viewer is positioned to the right of the table.

The output directory will be structured as in the following example:

output_directory/
cgview-js_code/

html_report_code/

style.css table-functions.js viewer-functions.js utilities.js

data/

genome_name_1.js genome_name_2.js …

report.html

proksee_batch.generate_report_html.generate_report_html(output_dir, genome_info)

Generates an HTML report file with a table containing links to Proksee projects and images for each sample. A single genome viewer is positioned to the right of the table.

Parameters:
  • output_dir (str)

  • genome_info (Dict[str, Any])

Return type:

None

proksee_batch.get_stats_from_seq_file

proksee_batch.get_stats_from_seq_file.get_stats_from_seq_file(seq_file, format)

Get basic stats from a GenBank or FASTA file.

Parameters:
  • seq_file (str)

  • format (str)

Return type:

Tuple[str, str, int, int, float]

proksee_batch.merge_cgview_json_with_template

proksee_batch.merge_cgview_json_with_template.merge_cgview_json_with_template(basic_json_file, template_file, output_file)

Merge a basic cgview map in JSON format with a Proksee configuration file in JSON format.

Parameters:
  • basic_json_file (str)

  • template_file (str)

  • output_file (str)

Return type:

None

proksee_batch.parse_additional_features

class proksee_batch.parse_additional_features.BedFeatureDict
class proksee_batch.parse_additional_features.BedMetaDict
class proksee_batch.parse_additional_features.BlastFeatureDict
class proksee_batch.parse_additional_features.BlastMetaDict
class proksee_batch.parse_additional_features.FeatureDecorationDict
class proksee_batch.parse_additional_features.GffFeatureDict
class proksee_batch.parse_additional_features.GffMetaDict
class proksee_batch.parse_additional_features.TrackDict
class proksee_batch.parse_additional_features.VcfFeatureDict
class proksee_batch.parse_additional_features.VcfMetaDict
proksee_batch.parse_additional_features.add_bed_features_and_tracks(bed_files, json_file, output_file)

Parses BED files, adds the parsed BED features and tracks to the cgview map JSON data structure, and writes the cgview map JSON data structure to a new file.

Parameters: bed_files (list): A list of paths to BED files. json_file (str): The path to a cgview map JSON file. output_file (str): The path to the output file.

Returns: None

Parameters:
  • bed_files (List[str])

  • json_file (str)

  • output_file (str)

Return type:

None

proksee_batch.parse_additional_features.add_blast_features_and_tracks(blast_files, json_file, output_file)

Parses BLAST result files, adds the parsed BLAST features and tracks to the cgview map JSON data structure, and writes the cgview map JSON data structure to a new file.

Parameters: blast_files (list): A list of paths to BLAST result files. json_file (str): The path to a cgview map JSON file. output_file (str): The path to the output file.

Returns: None

Parameters:
  • blast_files (List[str])

  • json_file (str)

  • output_file (str)

Return type:

None

proksee_batch.parse_additional_features.add_gff_features_and_tracks(gff_files, json_file, output_file)

Parses GFF files, adds the parsed GFF features and tracks to the cgview map JSON data structure, and writes the cgview map JSON data structure to a new file.

Parameters: gff_files (list): A list of paths to GFF files. json_file (str): The path to a cgview map JSON file. output_file (str): The path to the output file.

Returns: None

Parameters:
  • gff_files (List[str])

  • json_file (str)

  • output_file (str)

Return type:

None

proksee_batch.parse_additional_features.add_vcf_features_and_tracks(vcf_files, json_file, output_file)

Parses VCF files, adds the parsed VCF features and tracks to the cgview map JSON data structure, and writes the cgview map JSON data structure to a new file.

Parameters: vcf_files (list): A list of paths to VCF files. json_file (str): The path to a cgview map JSON file. output_file (str): The path to the output file.

Returns: None

Parameters:
  • vcf_files (List[str])

  • json_file (str)

  • output_file (str)

Return type:

None

proksee_batch.parse_additional_features.get_feature_locations_and_scores_from_bed_features(bed_features)

Gets feature locations and scores from BED features.

Parameters: bed_features (list): A list of parsed BED features.

Returns: list: A list of tuples containing feature locations and scores.

Parameters:

bed_features (List[BedFeatureDict])

Return type:

List[Tuple[int, int, float]]

proksee_batch.parse_additional_features.get_feature_locations_and_scores_from_blast_features(blast_features)

Gets feature locations and scores from BLAST features.

Parameters: blast_features (list): A list of parsed BLAST features.

Returns: list: A list of tuples containing feature locations and scores.

Parameters:

blast_features (List[BlastFeatureDict])

Return type:

List[Tuple[int, int, float]]

proksee_batch.parse_additional_features.get_feature_locations_and_scores_from_gff_features(gff_features)

Gets feature locations and scores from GFF features.

Parameters: gff_features (list): A list of parsed GFF features.

Returns: list: A list of tuples containing feature locations and scores.

Parameters:

gff_features (List[GffFeatureDict])

Return type:

List[Tuple[int, int, float]]

proksee_batch.parse_additional_features.get_feature_locations_and_scores_from_vcf_features(vcf_features)

Gets feature locations and scores from VCF features.

Parameters: vcf_features (list): A list of parsed VCF features.

Returns: list: A list of tuples containing feature locations and scores.

Parameters:

vcf_features (List[VcfFeatureDict])

Return type:

List[Tuple[int, int, float]]

proksee_batch.parse_additional_features.parse_bed_files(bed_files)

Parses BED files.

Parameters: bed_files (list): A list of paths to BED files.

Returns: tuple: A tuple containing a list of parsed BED features and a list of parsed BED tracks.

Parameters:

bed_files (List[str])

Return type:

Tuple[List[BedFeatureDict], List[TrackDict]]

proksee_batch.parse_additional_features.parse_blast_files(blast_files)

Parses BLAST result files.

Parameters: blast_files (list): A list of paths to BLAST result files.

Returns: tuple: A tuple containing a list of parsed BLAST features and a list of parsed BLAST tracks.

Parameters:

blast_files (List[str])

Return type:

Tuple[List[BlastFeatureDict], List[TrackDict]]

proksee_batch.parse_additional_features.parse_gff_files(gff_files)

Parses GFF files.

Parameters: gff_files (list): A list of paths to GFF files.

Returns: tuple: A tuple containing a list of parsed GFF features and a list of parsed GFF tracks.

Parameters:

gff_files (List[str])

Return type:

Tuple[List[GffFeatureDict], List[TrackDict]]

proksee_batch.parse_additional_features.parse_vcf_files(vcf_files)

Parses VCF files.

Parameters: vcf_files (list): A list of paths to VCF files.

Returns: tuple: A tuple containing a list of parsed VCF features and a list of parsed VCF tracks.

Parameters:

vcf_files (List[str])

Return type:

Tuple[List[VcfFeatureDict], List[TrackDict]]

proksee_batch.seq_file_to_cgview_json

proksee_batch.seq_file_to_cgview_json.fasta_to_cgview_json(genome_name, fasta_file, json_file)

Convert a FASTA file to a CGView JSON file. The JSON file will be in the same format as generated by the genbank_to_cgview_json function. There will be no features in the JSON file, only sequences/contigs.

Parameters:
  • genome_name (str)

  • fasta_file (str)

  • json_file (str)

Return type:

None

proksee_batch.seq_file_to_cgview_json.genbank_to_cgview_json(genome_name, genbank_file, json_file)

Convert a GenBank file to a CGView JSON file.

Parameters:
  • genome_name (str)

  • genbank_file (str)

  • json_file (str)

Return type:

None

proksee_batch.seq_file_to_cgview_json.remove_problematic_characters_from_contig_name(contig_name)

Remove problematic characters from a contig name. This is necessary some downstream software may not be able to handle contig names with certain characters.

Parameters:

contig_name (str)

Return type:

str

proksee_batch.seq_file_to_cgview_json.seq_to_json_contig(seq_id, seq)

Convert a sequence to a dictionary with the sequence ID, sequence length, and sequence. The dictionary will be in the format expected for a sequence in a CGView JSON file.

Parameters:
  • seq_id (str)

  • seq (str)

Return type:

Dict[str, Any]

proksee_batch.validate_input_data

The input directory must be structured as in the following example:

input_directory/
genome_name_1/
genbank/

genome1.gbk

fasta/

genome1.fna

blast/

abc.txt def.tsv

bed/

ghi.bed jkl.bed

json/

template1.json

vcf/

mno.vcf pqr.vcf

gff/

stu.gff vwx.gff3

genome_name_2/
genbank/

genome2.gbff

fasta/

genome2.fa

blast/

yza.txt bcd.tsv

bed/

efg.bed hij.bed

json/

template2.json

vcf/

klm.vcf nop.vcf

gff/

qrs.gff tuv.gff3

The genbank directory must contain a single GenBank file with the extension .gbk, .gbff, or .gb. This is the genome that will be visualized. If the genbank directory is not present, then proksee-batch will use a file from the fasta directory instead (otherwise the fasta directory is ignored). The blast, bed, vcf, and gff directories are optional. They contain files with additional genomic features. The json directory is also optional. It contains a custom Proksee project JSON file that will be used as a template for the visualization.

proksee_batch.validate_input_data.check_vcf_ref_vs_alt_genotypes(vcf_file_path, genome_file_path, genome_file_type)

Checks if the genotypes in the genome in the GenBank file match the REF genotypes in the VCF file. :param vcf_file_path: The path to the VCF file. :type vcf_file_path: str :param genbank_file_path: The path to the GenBank file. :type genbank_file_path: str

Returns:

True if the genotypes in the genome in the GenBank file match the REF genotypes in the VCF file, False otherwise.

Return type:

bool

Parameters:
  • vcf_file_path (str)

  • genome_file_path (str)

  • genome_file_type (str)

proksee_batch.validate_input_data.check_vcf_seq_ids(vcf_file_path, seq_file_path, seq_file_format)

Checks if all the sequence IDs in the first column of the VCF file are contigs in a GenBank or FASTA file.

Parameters:
  • vcf_file_path (str) – The path to the VCF file.

  • seq_file_path (str) – The path to the GenBank or FASTA file.

  • seq_file_format (str) – The format of the GenBank or FASTA file. Valid values are “genbank” and “fasta”.

Returns:

True if all sequence IDs in the VCF file are contigs in the sequence file, False otherwise.

Return type:

bool

proksee_batch.validate_input_data.get_data_files(input_subdir, data_type)

Returns the paths to the data files of the specified type in the provided subdirectory.

Parameters:
  • input_subdir (str) – The path to the subdirectory containing the data files.

  • data_type (str) – The type of the data files to be returned. Valid values are “genbank”, “blast”, “bed”, “json”, “vcf”, and “gff”.

Returns:

The paths to the data files.

Return type:

list

proksee_batch.validate_input_data.handle_error_exit(error_message, exit_code=1)

Handles errors by printing a message to sys.stderr and exiting the program.

Parameters:
  • error_message (str) – The error message to be printed.

  • exit_code (int) – The exit code to be used for sys.exit. Defaults to 1.

Return type:

None

Exits:

SystemExit: Exits the program with the provided exit code.

proksee_batch.validate_input_data.validate_input_directory_contents(input)

Validates if the provided input directory contains the required subdirectories and files.

Parameters:

input (str) – The path to the input directory to be checked.

Raises:

SystemExit – If the input directory does not contain the required subdirectories.

Return type:

None