Try an interactive version of this dialog: Sign up at solve.it.com, click Upload, and pass this URL.

Note: 48

Pyoverdine BGC Analysis

This dialog walks through identifying and characterizing a pyoverdine biosynthetic gene cluster from a bacterial genome annotation.

Note: 10

1. Load annotation data

Code: 6 ()

import pandas as pd

Code: 172 ()

# Read in the annotation
annot = pd.read_csv('/Users/johno/Downloads/XT4K5N_results/XT4K5N_1_sample_1/annotation/XT4K5N_1_sample_1_1_reference.tsv', sep='\t', comment='#', header=None, names=['Sequence Id', 'Type', 'Start', 'Stop', 'Strand', 'Locus Tag', 'Gene', 'Product', 'DbXrefs']) 
print(f"Total features annotated: {len(annot)}")
annot.head()

Output: 1,270

Total features annotated: 5952
Sequence Id Type Start Stop Strand Locus Tag Gene Product DbXrefs
0 contig_1 cds 7 1545 + LDMLEE_00001 dnaA chromosomal replication initiator protein DnaA GO:0003677, GO:0003688, GO:0005524, GO:0005737...
1 contig_1 cds 1586 2689 + LDMLEE_00002 dnaN DNA polymerase III subunit beta EC:2.7.7.7, GO:0003677, GO:0003887, GO:0006260...
2 contig_1 cds 2705 3808 + LDMLEE_00003 recF DNA replication/repair protein RecF COG:COG1195, COG:L, GO:0000731, GO:0003697, GO...
3 contig_1 cds 3813 6233 + LDMLEE_00004 gyrB DNA topoisomerase (ATP-hydrolyzing) subunit B EC:5.6.2.2, GO:0003677, GO:0003918, GO:0005524...
4 contig_1 cds 6557 7393 + LDMLEE_00005 dinD DNA damage-inducible protein D KEGG:K14623, SO:0001217, UniRef:UniRef50_N8WDK...

Note: 18

2. Search for pyoverdine-related genes

Code: 187 ()

# Search for pyoverdine-related genes
pvd_mask = annot['Gene'].str.contains('pvd|pta', case=False, na=False)
nrps_mask = annot['Product'].str.contains('NRPS|siderophore|pyoverdine|non-ribosomal', case=False, na=False)

pvd_genes = annot[pvd_mask | nrps_mask][['Start', 'Stop', 'Strand', 'Gene', 'Product']]
print(f"Found {len(pvd_genes)} potential pyoverdine-related genes")
pvd_genes.sort_values('Start')

Note: 96

Key findings: pvdA at ~3.99 Mb, pvdT (exporter) at ~4.71 Mb, large NRPS genes at ~4.73-4.75 Mb. The main cluster is around 4.71–4.75 Mb.

Note: 15

3. Examine the BGC cluster region

Code: 76 ()

cluster_region = annot[(annot['Start'] >= 4710000) & (annot['Stop'] <= 4760000)]
cluster_region[['Start', 'Stop', 'Strand', 'Gene', 'Product']].sort_values('Start')

Note: 136

This cluster contains: pvdT (exporter), PvdJ/PvdD/PvdP-like protein, SidA/IucD/PvdA-like monooxygenase, TonB-dependent receptor (likely fpvA), two large NRPS genes (~14.7 kb and ~6.4 kb), and argD (makes diaminobutyrate for chromophore).

Note: 12

4. antiSMASH analysis

Code: 243 ()

# Load antiSMASH results
import json

with open('/Users/johno/Downloads/XT4K5N_1_sample_1_antismash/XT4K5N_1_sample_1_1_reference.json') as f:
    antismash = json.load(f)

# Find the NRP-metallophore region (our pyoverdine cluster)
for rec in antismash.get('records', [antismash]):
    for region in rec.get('areas', rec.get('regions', [])):
        if 'NRP-metallophore' in str(region.get('products', [])):
            print(f"Region: {region['start']:,} - {region['end']:,}")
            print(f"Products: {region['products']}")

Output: 66

Region: 4,696,705 - 4,796,912
Products: ['NRP-metallophore', 'NRPS']

Note: 64

antiSMASH identifies region 10 (4,696,705–4,796,912) as NRP-metallophore + NRPS — this is the pyoverdine BGC.

Note: 88 ()

pasted_image

Note: 39 ()

NB SolG annotation is probably incorrect, and a fair bet this is a pyoverdine NRPS given the surrounding genes.

Note: 13

5. Extract A-domain substrate predictions

Code: 172 ()

# Parse A-domain predictions from region 10 GBK
import re

with open('/Users/johno/Downloads/XT4K5N_1_sample_1_antismash/contig_1.region010.gbk') as f:
    gbk = f.read()

# Extract A-domain substrate predictions
a_domains = re.findall(r'AMP-binding.*?/specificity="([^"]+)"', gbk, re.DOTALL)
for i, spec in enumerate(a_domains, 1):
    print(f"Module {i}: {spec}")

Output: 156

Module 1: substrate consensus: OH-Orn
Module 2: substrate consensus: X
Module 3: E activity: active
Module 4: substrate consensus: OH-Orn
Module 5: substrate consensus: Asp
Module 6: substrate consensus: Ala
Module 7: substrate consensus: Asp
Module 8: substrate consensus: Dab
Module 9: E activity: active
Module 10: substrate consensus: Tyr
Module 11: substrate consensus: Glu

Note: 295

Reading modules in biosynthetic order (11→1), the predicted peptide is:

Module Substrate Notes
11 Glu Chromophore precursor
10 Tyr Chromophore precursor
9 (E domain active) D-Tyr
8 Dab Chromophore precursor
7 Asp Variable peptide
6 Ala
5 Asp
4 OH-Orn Hydroxyornithine
3 (E domain active) D-OHOrn
2 X Unknown
1 OH-Orn

The Glu-Tyr-Dab is the conserved chromophore-forming tripeptide.

Note: 84 ()

The Minowa method predicts the unknown module 2 as "NH2" — suggesting an amino acid with a free amine (Lys, Orn, or Dab). Given the context, Lys is most likely. << Says Opus (see later)

Note: 25

6. Determine side-chain type (pvdN vs ptaA)

Note: 276

The side-chain type is determined by which gene is present:

  • pvdN (PLP-dependent aminotransferase) → succinamide side chain
  • ptaA (glutamate dehydrogenase) → α-ketoglutarate side chain

BLAST the genome proteins against reference sequences:

  • PvdN (P. aeruginosa PAO1): NP_251095.1
  • PtaA (P. protegens Pf-5): AAY94407.1

Use NCBI BLAST "Align two or more sequences" with the protein FASTA: /Users/johno/Downloads/XT4K5N_results/XT4K5N_1_sample_1/annotation/XT4K5N_1_sample_1_1_reference.faa

Note: 190

BLAST results:

Query Best hit Identity E-value Conclusion
PvdN (NP_251095.1) LDMLEE_03614 83% 6e-53 pvdN present
PtaA (AAY94407.1) LDMLEE_00269 23% 7e-09 ✗ No significant hit

pvdN is present (83% identity) → strain produces succinamide side chain

Note: 252 ()

Predicted Structure

Predicted pyoverdine structure:

  • Chromophore: Glu → D-Tyr → Dab
  • Variable peptide: Asp → Ala → Asp → D-OHOrn → Lys? → OHOrn
  • Side chain: Succinamide (pvdN present at 83% identity; ptaA absent)

The BGC spans two loci: biosynthetic genes (pvdA, pvdN) at ~3.99 Mb and the main NRPS cluster at ~4.71-4.75 Mb. antiSMASH classifies this as an NRP-metallophore cluster. The unknown residue at position 8 is predicted (by Opus) to be Lys based on Minowa "NH2" annotation — LC-MS confirmation recommended.

Note: 6 ()

Diagram and Updates

Output: 824

Prompt: 97 ()

More deep research turned up the idea that the mystery X is probably Gly. Gemini made the diagram code above (after I iterated on a worse-looking version for a bit) but made an (unfounded IMO) decision to remove the extra starting N. What do you think? Any revisions to the diagram to make?

Assistant: 553

Looking at the SMILES, Gemini's decision is actually correct. The first N in N[C@@H](CCC(=O)O)C(=O) is the N-terminus of L-Glu — that's the α-amino group. Having an extra N before that would be chemically wrong (you'd have a hydrazine or similar).

However, I do see one issue worth fixing:

The hydroxylamine groups are written as CCCN(O) which is incomplete SMILES (the nitrogen has an open valence). For N5-hydroxyornithine, it should be CCCNO (hydroxylamine: -NH-OH). RDKit likely added implicit hydrogens, so it renders okay, but it's not strictly correct.

Here's the fix:

Copied!
# Change these two lines:
"N[C@H](CCCN(O))C(=O)"         # D-OHOrn (atoms 50-58)
"N[C@@H](CCCN(O))C(=O)O"       # L-OHOrn (atoms 63-72)

# To:
"N[C@H](CCCNO)C(=O)"           # D-OHOrn 
"N[C@@H](CCCNO)C(=O)O"         # L-OHOrn

Also, if the mystery residue is confirmed as Gly (not Lys as speculated earlier), the current diagram and legend already show "Gly (predicted)" which is correct. You might want to update the annotation text to reflect that the "NH2" Minowa prediction → Gly interpretation, rather than the earlier Lys guess.

Want me to make these changes to the code cell?

Note: 123 ()

Note from the author: I'm out of my depth enough to let the AIs agree to disagree on the pretty picture front :) I'm hoping to do some LC-MS which should let us come back and resolve things with more certainty. For now, I'd say take the exact structure with a grain of salt BUT it looks like we do definitely have a novel pyoverdine on our hands.

  • johnowhitaker

Code: 13 ()

# !uv pip install py3Dmol

Code: 400 ()

import py3Dmol
from rdkit import Chem
from rdkit.Chem import AllChem

smiles = "N[C@@H](CCC(=O)O)C(=O)N[C@H](Cc1ccc(O)cc1)C(=O)N[C@@H](CCN)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](C)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@H](CCCNO)C(=O)NCC(=O)N[C@@H](CCCNO)C(=O)O"

mol = Chem.MolFromSmiles(smiles)
mol = Chem.AddHs(mol)
AllChem.EmbedMolecule(mol, randomSeed=42)
AllChem.MMFFOptimizeMolecule(mol)

mb = Chem.MolToMolBlock(mol)

view = py3Dmol.view(width=400, height=300)
view.addModel(mb, 'mol')
view.setStyle({'stick': {'colorscheme': 'greenCarbon', 'radius': 0.15}})
view.setBackgroundColor('white')
view.zoomTo()
view.show()

Output: 14,464

3Dmol.js failed to load for some reason. Please check your browser console for error messages.