Note: 48

Pyoverdine BGC Analysis

This dialog walks through identifying and characterizing a pyoverdine biosynthetic gene cluster from a bacterial genome annotation.

Note: 10

1. Load annotation data

Code: 6 ()

import pandas as pd

Code: 172 ()

# Read in the annotation
annot = pd.read_csv('/Users/johno/Downloads/XT4K5N_results/XT4K5N_1_sample_1/annotation/XT4K5N_1_sample_1_1_reference.tsv', sep='\t', comment='#', header=None, names=['Sequence Id', 'Type', 'Start', 'Stop', 'Strand', 'Locus Tag', 'Gene', 'Product', 'DbXrefs']) 
print(f"Total features annotated: {len(annot)}")
annot.head()

Output: 1,270

Total features annotated: 5952

	Sequence Id	Type	Start	Stop	Strand	Locus Tag	Gene	Product	DbXrefs
0	contig_1	cds	7	1545	+	LDMLEE_00001	dnaA	chromosomal replication initiator protein DnaA	GO:0003677, GO:0003688, GO:0005524, GO:0005737...
1	contig_1	cds	1586	2689	+	LDMLEE_00002	dnaN	DNA polymerase III subunit beta	EC:2.7.7.7, GO:0003677, GO:0003887, GO:0006260...
2	contig_1	cds	2705	3808	+	LDMLEE_00003	recF	DNA replication/repair protein RecF	COG:COG1195, COG:L, GO:0000731, GO:0003697, GO...
3	contig_1	cds	3813	6233	+	LDMLEE_00004	gyrB	DNA topoisomerase (ATP-hydrolyzing) subunit B	EC:5.6.2.2, GO:0003677, GO:0003918, GO:0005524...
4	contig_1	cds	6557	7393	+	LDMLEE_00005	dinD	DNA damage-inducible protein D	KEGG:K14623, SO:0001217, UniRef:UniRef50_N8WDK...

Note: 18

2. Search for pyoverdine-related genes

Code: 187 ()

# Search for pyoverdine-related genes
pvd_mask = annot['Gene'].str.contains('pvd|pta', case=False, na=False)
nrps_mask = annot['Product'].str.contains('NRPS|siderophore|pyoverdine|non-ribosomal', case=False, na=False)

pvd_genes = annot[pvd_mask | nrps_mask][['Start', 'Stop', 'Strand', 'Gene', 'Product']]
print(f"Found {len(pvd_genes)} potential pyoverdine-related genes")
pvd_genes.sort_values('Start')

Output: 3,577

Found 33 potential pyoverdine-related genes

	Start	Stop	Strand	Gene	Product
103	104161	105819	-	nptA	PhoU domain-containing protein
262	302870	304945	-	NaN	TonB-dependent siderophore receptor
268	312380	314677	+	fhuE	TonB-dependent siderophore receptor
272	316413	318929	-	NaN	TonB-dependent siderophore receptor
361	417833	420265	-	NaN	TonB-dependent siderophore receptor
562	611855	614032	+	fhuE	Ferripyoverdine receptor
689	741595	744024	-	NaN	TonB-dependent siderophore receptor
749	808069	810498	+	NaN	TonB-dependent siderophore receptor
775	834621	835178	+	hptA	hypoxanthine-guanine phosphoribosyltransferase
801	860991	863078	-	pta	phosphate acetyltransferase
850	948335	950860	+	NaN	TonB-dependent siderophore receptor
892	989285	991576	-	fiu	TonB-dependent siderophore receptor
898	996667	999096	+	fecA	TonB-dependent siderophore receptor
950	1054211	1054735	-	lptA	lipopolysaccharide transport periplasmic prote...
1708	1871826	1874303	+	NaN	TonB-dependent siderophore receptor
1727	1893842	1895953	-	NaN	TonB-dependent siderophore receptor
1746	1911335	1913770	+	cirA	TonB-dependent siderophore receptor
2009	2213982	2216111	-	fhuE	Outer membrane ferric siderophore receptor
2060	2269455	2271884	+	NaN	TonB-dependent siderophore receptor
2104	2340708	2343158	-	NaN	Ferripyoverdine receptor
2168	2420925	2423168	-	fiu	Outer membrane ferric siderophore receptor
2169	2423286	2424275	-	ceuA	Siderophore ABC transporter substrate-binding ...
2170	2424272	2425048	-	ceuD	Ferric siderophore ABC transporter, ATP-bindin...
2171	2425045	2425986	-	ceuC	Ferric siderophore ABC transporter, permease p...
2252	2523109	2525544	-	NaN	Outer membrane ferripyoverdine receptor
3384	3715001	3717406	-	fiu	TonB-dependent siderophore receptor
3631	3994194	3995528	+	pvdA	L-ornithine N5-oxygenase PvdA
4306	4713317	4715284	+	pvdT	Pyoverdine export ATP-binding/permease protein...
4313	4724037	4726460	-	NaN	TonB-dependent siderophore receptor
4315	4727582	4742314	-	solG	solanimycin non-ribosomal peptide synthetase SolG
4316	4742480	4748890	-	NaN	Non-ribosomal peptide synthetase
4317	4749066	4750019	+	NaN	Siderophore interacting FAD binding protein
5711	6278694	6281057	+	pvdQ	Acyl-homoserine lactone acylase QuiP

Note: 96

Key findings: pvdA at ~3.99 Mb, pvdT (exporter) at ~4.71 Mb, large NRPS genes at ~4.73-4.75 Mb. The main cluster is around 4.71–4.75 Mb.

Note: 15

3. Examine the BGC cluster region

Code: 76 ()

cluster_region = annot[(annot['Start'] >= 4710000) & (annot['Stop'] <= 4760000)]
cluster_region[['Start', 'Stop', 'Strand', 'Gene', 'Product']].sort_values('Start')

Output: 2,574

	Start	Stop	Strand	Gene	Product
4302	4710162	4710641	-	NaN	Histidine kinase
4303	4710903	4711364	+	azu	azurin
4304	4711402	4711902	-	NaN	Sigma-70 family RNA polymerase sigma factor
4305	4712134	4713315	+	NaN	Membrane fusion protein, macrolide-specific ef...
4306	4713317	4715284	+	pvdT	Pyoverdine export ATP-binding/permease protein...
4307	4715281	4716699	+	NaN	RND transporter
4308	4716706	4718316	-	NaN	PvdJ/PvdD/PvdP-like protein
4309	4718535	4719845	+	NaN	Membrane dipeptidase
4310	4719842	4720717	+	NaN	Formylglycine-generating enzyme family protein
4311	4720776	4722104	+	NaN	SidA/IucD/PvdA family monooxygenase
4312	4722123	4723772	+	NaN	Cyclic peptide transporter
4313	4724037	4726460	-	NaN	TonB-dependent siderophore receptor
4314	4726576	4727529	-	NaN	Alpha/beta hydrolase
4315	4727582	4742314	-	solG	solanimycin non-ribosomal peptide synthetase SolG
4316	4742480	4748890	-	NaN	Non-ribosomal peptide synthetase
4317	4749066	4750019	+	NaN	Siderophore interacting FAD binding protein
4318	4750161	4751552	-	argD	Diaminobutyrate-2-oxoglutarate transaminase
4319	4751767	4753089	-	baeS	histidine kinase
4320	4753086	4753769	-	ompR	Two component system DNA-binding transcription...
4321	4754140	4754703	-	NaN	NTP transferase domain-containing protein
4322	4754700	4755671	-	xdhC	XdhC family protein
4323	4755674	4756900	-	cccA	Gluconate 2-dehydrogenase (Acceptor)
4324	4756903	4757439	-	cutS	2Fe-2S ferredoxin-type domain-containing protein

Note: 136

This cluster contains: pvdT (exporter), PvdJ/PvdD/PvdP-like protein, SidA/IucD/PvdA-like monooxygenase, TonB-dependent receptor (likely fpvA), two large NRPS genes (~14.7 kb and ~6.4 kb), and argD (makes diaminobutyrate for chromophore).

Note: 12

4. antiSMASH analysis

Code: 243 ()

# Load antiSMASH results
import json

with open('/Users/johno/Downloads/XT4K5N_1_sample_1_antismash/XT4K5N_1_sample_1_1_reference.json') as f:
    antismash = json.load(f)

# Find the NRP-metallophore region (our pyoverdine cluster)
for rec in antismash.get('records', [antismash]):
    for region in rec.get('areas', rec.get('regions', [])):
        if 'NRP-metallophore' in str(region.get('products', [])):
            print(f"Region: {region['start']:,} - {region['end']:,}")
            print(f"Products: {region['products']}")

Output: 66

Region: 4,696,705 - 4,796,912
Products: ['NRP-metallophore', 'NRPS']

Note: 64

antiSMASH identifies region 10 (4,696,705–4,796,912) as NRP-metallophore + NRPS — this is the pyoverdine BGC.

Note: 88 ()

pasted_image

Note: 39 ()

NB SolG annotation is probably incorrect, and a fair bet this is a pyoverdine NRPS given the surrounding genes.

Note: 13

5. Extract A-domain substrate predictions

Code: 172 ()

# Parse A-domain predictions from region 10 GBK
import re

with open('/Users/johno/Downloads/XT4K5N_1_sample_1_antismash/contig_1.region010.gbk') as f:
    gbk = f.read()

# Extract A-domain substrate predictions
a_domains = re.findall(r'AMP-binding.*?/specificity="([^"]+)"', gbk, re.DOTALL)
for i, spec in enumerate(a_domains, 1):
    print(f"Module {i}: {spec}")

Output: 156

Module 1: substrate consensus: OH-Orn
Module 2: substrate consensus: X
Module 3: E activity: active
Module 4: substrate consensus: OH-Orn
Module 5: substrate consensus: Asp
Module 6: substrate consensus: Ala
Module 7: substrate consensus: Asp
Module 8: substrate consensus: Dab
Module 9: E activity: active
Module 10: substrate consensus: Tyr
Module 11: substrate consensus: Glu

Note: 295

Reading modules in biosynthetic order (11→1), the predicted peptide is:

Module	Substrate	Notes
11	Glu	Chromophore precursor
10	Tyr	Chromophore precursor
9	(E domain active)	→ D-Tyr
8	Dab	Chromophore precursor
7	Asp	Variable peptide
6	Ala
5	Asp
4	OH-Orn	Hydroxyornithine
3	(E domain active)	→ D-OHOrn
2	X	Unknown
1	OH-Orn

The Glu-Tyr-Dab is the conserved chromophore-forming tripeptide.

Note: 84 ()

The Minowa method predicts the unknown module 2 as "NH2" — suggesting an amino acid with a free amine (Lys, Orn, or Dab). Given the context, Lys is most likely. << Says Opus (see later)

Note: 25

6. Determine side-chain type (pvdN vs ptaA)

Note: 276

The side-chain type is determined by which gene is present:

pvdN (PLP-dependent aminotransferase) → succinamide side chain
ptaA (glutamate dehydrogenase) → α-ketoglutarate side chain

BLAST the genome proteins against reference sequences:

PvdN (P. aeruginosa PAO1): NP_251095.1
PtaA (P. protegens Pf-5): AAY94407.1

Use NCBI BLAST "Align two or more sequences" with the protein FASTA: /Users/johno/Downloads/XT4K5N_results/XT4K5N_1_sample_1/annotation/XT4K5N_1_sample_1_1_reference.faa

Note: 190

BLAST results:

Query	Best hit	Identity	E-value	Conclusion
PvdN (NP_251095.1)	LDMLEE_03614	83%	6e-53	✓ pvdN present
PtaA (AAY94407.1)	LDMLEE_00269	23%	7e-09	✗ No significant hit

pvdN is present (83% identity) → strain produces succinamide side chain

Note: 252 ()

Predicted Structure

Predicted pyoverdine structure:

Chromophore: Glu → D-Tyr → Dab
Variable peptide: Asp → Ala → Asp → D-OHOrn → Lys? → OHOrn
Side chain: Succinamide (pvdN present at 83% identity; ptaA absent)

The BGC spans two loci: biosynthetic genes (pvdA, pvdN) at ~3.99 Mb and the main NRPS cluster at ~4.71-4.75 Mb. antiSMASH classifies this as an NRP-metallophore cluster. The unknown residue at position 8 is predicted (by Opus) to be Lys based on Minowa "NH2" annotation — LC-MS confirmation recommended.

Note: 6 ()

Diagram and Updates

Code: 2,293 ()

from rdkit import Chem
from rdkit.Chem import rdCoordGen
from rdkit.Chem.Draw import rdMolDraw2D
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from PIL import Image
import io

# 1. CORRECTED SMILES
# - Removed extra 'N' at start
# - Changed 'CCCNO' to 'CCCN(O)' for proper Hydroxyornithine structure
smiles_stereo = (
    "N[C@@H](CCC(=O)O)C(=O)"       # L-Glu (atoms 0-9)
    "N[C@H](Cc1ccc(O)cc1)C(=O)"    # D-Tyr (atoms 10-21)
    "N[C@@H](CCN)C(=O)"            # L-Dab (atoms 22-28)
    "N[C@@H](CC(=O)O)C(=O)"        # L-Asp (atoms 29-36)
    "N[C@@H](C)C(=O)"              # L-Ala (atoms 37-41)
    "N[C@@H](CC(=O)O)C(=O)"        # L-Asp (atoms 42-49)
    "N[C@H](CCCN(O))C(=O)"         # D-OHOrn (atoms 50-58)
    "NCC(=O)"                      # Gly (atoms 59-62)
    "N[C@@H](CCCN(O))C(=O)O"       # L-OHOrn (atoms 63-72)
)

mol = Chem.MolFromSmiles(smiles_stereo)
rdCoordGen.AddCoords(mol)

# 2. UPDATED Atom Ranges (Shifted by -1 due to removal of extra N)
residue_indices = {
    'L-Glu': range(0, 9),      # 9 atoms
    'D-Tyr': range(9, 21),     # 12 atoms
    'L-Dab': range(21, 28),    # 7 atoms
    'L-Asp₁': range(28, 36),   # 8 atoms
    'L-Ala': range(36, 41),    # 5 atoms
    'L-Asp₂': range(41, 49),   # 8 atoms
    'D-OHOrn': range(49, 58),  # 9 atoms
    'Gly': range(58, 62),      # 4 atoms
    'L-OHOrn': range(62, 72)   # 10 atoms
}

# 3. Assign colors (Same palette)
colors = [
    (0.9, 0.3, 0.3),   # Red - Glu
    (0.9, 0.6, 0.2),   # Orange - Tyr
    (0.9, 0.9, 0.2),   # Yellow - Dab
    (0.4, 0.8, 0.4),   # Green - Asp1
    (0.3, 0.7, 0.7),   # Teal - Ala
    (0.3, 0.5, 0.9),   # Blue - Asp2
    (0.5, 0.3, 0.9),   # Purple - D-OHOrn
    (0.8, 0.4, 0.8),   # Magenta - Gly
    (0.9, 0.5, 0.6),   # Pink - L-OHOrn
]

highlight_atoms = {}
highlight_colors = {}

for i, (res_name, atom_range) in enumerate(residue_indices.items()):
    for atom_idx in atom_range:
        highlight_atoms[atom_idx] = colors[i]
        highlight_colors[atom_idx] = colors[i]

# 4. Draw the molecule
drawer = rdMolDraw2D.MolDraw2DCairo(1400, 800)
drawer.drawOptions().addAtomIndices = False
drawer.drawOptions().padding = 0.05
# Optional: Make bonds thicker for better visibility
drawer.drawOptions().bondLineWidth = 2
drawer.DrawMolecule(mol, highlightAtoms=list(highlight_atoms.keys()),
                    highlightAtomColors=highlight_colors)
drawer.FinishDrawing()

# 5. Combine with Legend in Matplotlib
mol_png = drawer.GetDrawingText()
mol_img = Image.open(io.BytesIO(mol_png))

fig, (ax_mol, ax_legend) = plt.subplots(1, 2, figsize=(18, 8), 
                                         gridspec_kw={'width_ratios': [2.5, 1]})

# Display Molecule
ax_mol.imshow(mol_img)
ax_mol.axis('off')
ax_mol.set_title('Predicted Pyoverdine Precursor (Ferribactin)\nPseudomonas putida (Houseplant Root Isolate)', 
                  fontsize=16, fontweight='bold', pad=20)

# Create Legend
legend_elements = [Patch(facecolor=c, label=l) for c, l in zip(colors, [
    '1. L-Glu (becomes side chain)',
    '2. D-Tyr (chromophore core)',
    '3. L-Dab (chromophore core)',
    '4. L-Asp',
    '5. L-Ala',
    '6. L-Asp',
    '7. D-OHOrn (Fe³⁺ ligand)',
    '8. Gly (predicted)',
    '9. L-OHOrn (Fe³⁺ ligand)'
])]

ax_legend.axis('off')
ax_legend.legend(handles=legend_elements, loc='upper right', fontsize=12, 
                  title='Peptide Sequence', title_fontsize=14, frameon=True,
                  fancybox=True, shadow=True)

# Add Annotation Box
annotations = """
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
STRUCTURE PREDICTION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
► Side chain: Succinamide
  (inferred from pvdN presence)

► Chromophore: Dihydroxyquinoline
  (formed from residues 1-3 cyclization)

► Fe³⁺ Binding:
  Via 2× Hydroxamate groups (residues 7,9)
  and Chromophore catechol.

► Status: NOVEL VARIANT
  Sequence distinct from known P. putida types.
  "Gly" at pos 8 inferred from "NH2" prediction.
  LC-MS/MS required for final confirmation.
"""
ax_legend.text(0.5, 0.0, annotations, transform=ax_legend.transAxes, 
               fontsize=11, verticalalignment='bottom', horizontalalignment='center',
               family='monospace', bbox=dict(boxstyle='round', facecolor='#f8f9fa', alpha=1.0))

plt.tight_layout()
plt.savefig('pyoverdine_final_structure.png', dpi=150, bbox_inches='tight')
plt.show()

Output: 824

Prompt: 97 ()

More deep research turned up the idea that the mystery X is probably Gly. Gemini made the diagram code above (after I iterated on a worse-looking version for a bit) but made an (unfounded IMO) decision to remove the extra starting N. What do you think? Any revisions to the diagram to make?

Assistant: 553

Looking at the SMILES, Gemini's decision is actually correct. The first N in N[C@@H](CCC(=O)O)C(=O) is the N-terminus of L-Glu — that's the α-amino group. Having an extra N before that would be chemically wrong (you'd have a hydrazine or similar).

However, I do see one issue worth fixing:

The hydroxylamine groups are written as CCCN(O) which is incomplete SMILES (the nitrogen has an open valence). For N5-hydroxyornithine, it should be CCCNO (hydroxylamine: -NH-OH). RDKit likely added implicit hydrogens, so it renders okay, but it's not strictly correct.

Here's the fix:

Copied!

# Change these two lines:
"N[C@H](CCCN(O))C(=O)"         # D-OHOrn (atoms 50-58)
"N[C@@H](CCCN(O))C(=O)O"       # L-OHOrn (atoms 63-72)

# To:
"N[C@H](CCCNO)C(=O)"           # D-OHOrn 
"N[C@@H](CCCNO)C(=O)O"         # L-OHOrn

Also, if the mystery residue is confirmed as Gly (not Lys as speculated earlier), the current diagram and legend already show "Gly (predicted)" which is correct. You might want to update the annotation text to reflect that the "NH2" Minowa prediction → Gly interpretation, rather than the earlier Lys guess.

Want me to make these changes to the code cell?

Note: 123 ()

Note from the author: I'm out of my depth enough to let the AIs agree to disagree on the pretty picture front :) I'm hoping to do some LC-MS which should let us come back and resolve things with more certainty. For now, I'd say take the exact structure with a grain of salt BUT it looks like we do definitely have a novel pyoverdine on our hands.

johnowhitaker

Code: 13 ()

# !uv pip install py3Dmol

Code: 400 ()

import py3Dmol
from rdkit import Chem
from rdkit.Chem import AllChem

smiles = "N[C@@H](CCC(=O)O)C(=O)N[C@H](Cc1ccc(O)cc1)C(=O)N[C@@H](CCN)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](C)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@H](CCCNO)C(=O)NCC(=O)N[C@@H](CCCNO)C(=O)O"

mol = Chem.MolFromSmiles(smiles)
mol = Chem.AddHs(mol)
AllChem.EmbedMolecule(mol, randomSeed=42)
AllChem.MMFFOptimizeMolecule(mol)

mb = Chem.MolToMolBlock(mol)

view = py3Dmol.view(width=400, height=300)
view.addModel(mb, 'mol')
view.setStyle({'stick': {'colorscheme': 'greenCarbon', 'radius': 0.15}})
view.setBackgroundColor('white')
view.zoomTo()
view.show()

Output: 14,464

3Dmol.js failed to load for some reason. Please check your browser console for error messages.