Try an interactive version of this dialog: Sign up at solve.it.com, click Upload, and pass this URL.
# Read in the annotation
annot = pd.read_csv('/Users/johno/Downloads/XT4K5N_results/XT4K5N_1_sample_1/annotation/XT4K5N_1_sample_1_1_reference.tsv', sep='\t', comment='#', header=None, names=['Sequence Id', 'Type', 'Start', 'Stop', 'Strand', 'Locus Tag', 'Gene', 'Product', 'DbXrefs'])
print(f"Total features annotated: {len(annot)}")
annot.head()
Total features annotated: 5952
| Sequence Id | Type | Start | Stop | Strand | Locus Tag | Gene | Product | DbXrefs | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | contig_1 | cds | 7 | 1545 | + | LDMLEE_00001 | dnaA | chromosomal replication initiator protein DnaA | GO:0003677, GO:0003688, GO:0005524, GO:0005737... |
| 1 | contig_1 | cds | 1586 | 2689 | + | LDMLEE_00002 | dnaN | DNA polymerase III subunit beta | EC:2.7.7.7, GO:0003677, GO:0003887, GO:0006260... |
| 2 | contig_1 | cds | 2705 | 3808 | + | LDMLEE_00003 | recF | DNA replication/repair protein RecF | COG:COG1195, COG:L, GO:0000731, GO:0003697, GO... |
| 3 | contig_1 | cds | 3813 | 6233 | + | LDMLEE_00004 | gyrB | DNA topoisomerase (ATP-hydrolyzing) subunit B | EC:5.6.2.2, GO:0003677, GO:0003918, GO:0005524... |
| 4 | contig_1 | cds | 6557 | 7393 | + | LDMLEE_00005 | dinD | DNA damage-inducible protein D | KEGG:K14623, SO:0001217, UniRef:UniRef50_N8WDK... |
# Search for pyoverdine-related genes
pvd_mask = annot['Gene'].str.contains('pvd|pta', case=False, na=False)
nrps_mask = annot['Product'].str.contains('NRPS|siderophore|pyoverdine|non-ribosomal', case=False, na=False)
pvd_genes = annot[pvd_mask | nrps_mask][['Start', 'Stop', 'Strand', 'Gene', 'Product']]
print(f"Found {len(pvd_genes)} potential pyoverdine-related genes")
pvd_genes.sort_values('Start')
Found 33 potential pyoverdine-related genes
| Start | Stop | Strand | Gene | Product | |
|---|---|---|---|---|---|
| 103 | 104161 | 105819 | - | nptA | PhoU domain-containing protein |
| 262 | 302870 | 304945 | - | NaN | TonB-dependent siderophore receptor |
| 268 | 312380 | 314677 | + | fhuE | TonB-dependent siderophore receptor |
| 272 | 316413 | 318929 | - | NaN | TonB-dependent siderophore receptor |
| 361 | 417833 | 420265 | - | NaN | TonB-dependent siderophore receptor |
| 562 | 611855 | 614032 | + | fhuE | Ferripyoverdine receptor |
| 689 | 741595 | 744024 | - | NaN | TonB-dependent siderophore receptor |
| 749 | 808069 | 810498 | + | NaN | TonB-dependent siderophore receptor |
| 775 | 834621 | 835178 | + | hptA | hypoxanthine-guanine phosphoribosyltransferase |
| 801 | 860991 | 863078 | - | pta | phosphate acetyltransferase |
| 850 | 948335 | 950860 | + | NaN | TonB-dependent siderophore receptor |
| 892 | 989285 | 991576 | - | fiu | TonB-dependent siderophore receptor |
| 898 | 996667 | 999096 | + | fecA | TonB-dependent siderophore receptor |
| 950 | 1054211 | 1054735 | - | lptA | lipopolysaccharide transport periplasmic prote... |
| 1708 | 1871826 | 1874303 | + | NaN | TonB-dependent siderophore receptor |
| 1727 | 1893842 | 1895953 | - | NaN | TonB-dependent siderophore receptor |
| 1746 | 1911335 | 1913770 | + | cirA | TonB-dependent siderophore receptor |
| 2009 | 2213982 | 2216111 | - | fhuE | Outer membrane ferric siderophore receptor |
| 2060 | 2269455 | 2271884 | + | NaN | TonB-dependent siderophore receptor |
| 2104 | 2340708 | 2343158 | - | NaN | Ferripyoverdine receptor |
| 2168 | 2420925 | 2423168 | - | fiu | Outer membrane ferric siderophore receptor |
| 2169 | 2423286 | 2424275 | - | ceuA | Siderophore ABC transporter substrate-binding ... |
| 2170 | 2424272 | 2425048 | - | ceuD | Ferric siderophore ABC transporter, ATP-bindin... |
| 2171 | 2425045 | 2425986 | - | ceuC | Ferric siderophore ABC transporter, permease p... |
| 2252 | 2523109 | 2525544 | - | NaN | Outer membrane ferripyoverdine receptor |
| 3384 | 3715001 | 3717406 | - | fiu | TonB-dependent siderophore receptor |
| 3631 | 3994194 | 3995528 | + | pvdA | L-ornithine N5-oxygenase PvdA |
| 4306 | 4713317 | 4715284 | + | pvdT | Pyoverdine export ATP-binding/permease protein... |
| 4313 | 4724037 | 4726460 | - | NaN | TonB-dependent siderophore receptor |
| 4315 | 4727582 | 4742314 | - | solG | solanimycin non-ribosomal peptide synthetase SolG |
| 4316 | 4742480 | 4748890 | - | NaN | Non-ribosomal peptide synthetase |
| 4317 | 4749066 | 4750019 | + | NaN | Siderophore interacting FAD binding protein |
| 5711 | 6278694 | 6281057 | + | pvdQ | Acyl-homoserine lactone acylase QuiP |
cluster_region = annot[(annot['Start'] >= 4710000) & (annot['Stop'] <= 4760000)]
cluster_region[['Start', 'Stop', 'Strand', 'Gene', 'Product']].sort_values('Start')
| Start | Stop | Strand | Gene | Product | |
|---|---|---|---|---|---|
| 4302 | 4710162 | 4710641 | - | NaN | Histidine kinase |
| 4303 | 4710903 | 4711364 | + | azu | azurin |
| 4304 | 4711402 | 4711902 | - | NaN | Sigma-70 family RNA polymerase sigma factor |
| 4305 | 4712134 | 4713315 | + | NaN | Membrane fusion protein, macrolide-specific ef... |
| 4306 | 4713317 | 4715284 | + | pvdT | Pyoverdine export ATP-binding/permease protein... |
| 4307 | 4715281 | 4716699 | + | NaN | RND transporter |
| 4308 | 4716706 | 4718316 | - | NaN | PvdJ/PvdD/PvdP-like protein |
| 4309 | 4718535 | 4719845 | + | NaN | Membrane dipeptidase |
| 4310 | 4719842 | 4720717 | + | NaN | Formylglycine-generating enzyme family protein |
| 4311 | 4720776 | 4722104 | + | NaN | SidA/IucD/PvdA family monooxygenase |
| 4312 | 4722123 | 4723772 | + | NaN | Cyclic peptide transporter |
| 4313 | 4724037 | 4726460 | - | NaN | TonB-dependent siderophore receptor |
| 4314 | 4726576 | 4727529 | - | NaN | Alpha/beta hydrolase |
| 4315 | 4727582 | 4742314 | - | solG | solanimycin non-ribosomal peptide synthetase SolG |
| 4316 | 4742480 | 4748890 | - | NaN | Non-ribosomal peptide synthetase |
| 4317 | 4749066 | 4750019 | + | NaN | Siderophore interacting FAD binding protein |
| 4318 | 4750161 | 4751552 | - | argD | Diaminobutyrate-2-oxoglutarate transaminase |
| 4319 | 4751767 | 4753089 | - | baeS | histidine kinase |
| 4320 | 4753086 | 4753769 | - | ompR | Two component system DNA-binding transcription... |
| 4321 | 4754140 | 4754703 | - | NaN | NTP transferase domain-containing protein |
| 4322 | 4754700 | 4755671 | - | xdhC | XdhC family protein |
| 4323 | 4755674 | 4756900 | - | cccA | Gluconate 2-dehydrogenase (Acceptor) |
| 4324 | 4756903 | 4757439 | - | cutS | 2Fe-2S ferredoxin-type domain-containing protein |
# Load antiSMASH results
import json
with open('/Users/johno/Downloads/XT4K5N_1_sample_1_antismash/XT4K5N_1_sample_1_1_reference.json') as f:
antismash = json.load(f)
# Find the NRP-metallophore region (our pyoverdine cluster)
for rec in antismash.get('records', [antismash]):
for region in rec.get('areas', rec.get('regions', [])):
if 'NRP-metallophore' in str(region.get('products', [])):
print(f"Region: {region['start']:,} - {region['end']:,}")
print(f"Products: {region['products']}")
# Parse A-domain predictions from region 10 GBK
import re
with open('/Users/johno/Downloads/XT4K5N_1_sample_1_antismash/contig_1.region010.gbk') as f:
gbk = f.read()
# Extract A-domain substrate predictions
a_domains = re.findall(r'AMP-binding.*?/specificity="([^"]+)"', gbk, re.DOTALL)
for i, spec in enumerate(a_domains, 1):
print(f"Module {i}: {spec}")
Module 1: substrate consensus: OH-Orn
Module 2: substrate consensus: X
Module 3: E activity: active
Module 4: substrate consensus: OH-Orn
Module 5: substrate consensus: Asp
Module 6: substrate consensus: Ala
Module 7: substrate consensus: Asp
Module 8: substrate consensus: Dab
Module 9: E activity: active
Module 10: substrate consensus: Tyr
Module 11: substrate consensus: Glu
Reading modules in biosynthetic order (11→1), the predicted peptide is:
| Module | Substrate | Notes |
|---|---|---|
| 11 | Glu | Chromophore precursor |
| 10 | Tyr | Chromophore precursor |
| 9 | (E domain active) | → D-Tyr |
| 8 | Dab | Chromophore precursor |
| 7 | Asp | Variable peptide |
| 6 | Ala | |
| 5 | Asp | |
| 4 | OH-Orn | Hydroxyornithine |
| 3 | (E domain active) | → D-OHOrn |
| 2 | X | Unknown |
| 1 | OH-Orn |
The Glu-Tyr-Dab is the conserved chromophore-forming tripeptide.
The side-chain type is determined by which gene is present:
- pvdN (PLP-dependent aminotransferase) → succinamide side chain
- ptaA (glutamate dehydrogenase) → α-ketoglutarate side chain
BLAST the genome proteins against reference sequences:
- PvdN (P. aeruginosa PAO1):
NP_251095.1 - PtaA (P. protegens Pf-5):
AAY94407.1
Use NCBI BLAST "Align two or more sequences" with the protein FASTA:
/Users/johno/Downloads/XT4K5N_results/XT4K5N_1_sample_1/annotation/XT4K5N_1_sample_1_1_reference.faa
Predicted Structure
Predicted pyoverdine structure:
- Chromophore: Glu → D-Tyr → Dab
- Variable peptide: Asp → Ala → Asp → D-OHOrn → Lys? → OHOrn
- Side chain: Succinamide (pvdN present at 83% identity; ptaA absent)
The BGC spans two loci: biosynthetic genes (pvdA, pvdN) at ~3.99 Mb and the main NRPS cluster at ~4.71-4.75 Mb. antiSMASH classifies this as an NRP-metallophore cluster. The unknown residue at position 8 is predicted (by Opus) to be Lys based on Minowa "NH2" annotation — LC-MS confirmation recommended.
from rdkit import Chem
from rdkit.Chem import rdCoordGen
from rdkit.Chem.Draw import rdMolDraw2D
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from PIL import Image
import io
# 1. CORRECTED SMILES
# - Removed extra 'N' at start
# - Changed 'CCCNO' to 'CCCN(O)' for proper Hydroxyornithine structure
smiles_stereo = (
"N[C@@H](CCC(=O)O)C(=O)" # L-Glu (atoms 0-9)
"N[C@H](Cc1ccc(O)cc1)C(=O)" # D-Tyr (atoms 10-21)
"N[C@@H](CCN)C(=O)" # L-Dab (atoms 22-28)
"N[C@@H](CC(=O)O)C(=O)" # L-Asp (atoms 29-36)
"N[C@@H](C)C(=O)" # L-Ala (atoms 37-41)
"N[C@@H](CC(=O)O)C(=O)" # L-Asp (atoms 42-49)
"N[C@H](CCCN(O))C(=O)" # D-OHOrn (atoms 50-58)
"NCC(=O)" # Gly (atoms 59-62)
"N[C@@H](CCCN(O))C(=O)O" # L-OHOrn (atoms 63-72)
)
mol = Chem.MolFromSmiles(smiles_stereo)
rdCoordGen.AddCoords(mol)
# 2. UPDATED Atom Ranges (Shifted by -1 due to removal of extra N)
residue_indices = {
'L-Glu': range(0, 9), # 9 atoms
'D-Tyr': range(9, 21), # 12 atoms
'L-Dab': range(21, 28), # 7 atoms
'L-Asp₁': range(28, 36), # 8 atoms
'L-Ala': range(36, 41), # 5 atoms
'L-Asp₂': range(41, 49), # 8 atoms
'D-OHOrn': range(49, 58), # 9 atoms
'Gly': range(58, 62), # 4 atoms
'L-OHOrn': range(62, 72) # 10 atoms
}
# 3. Assign colors (Same palette)
colors = [
(0.9, 0.3, 0.3), # Red - Glu
(0.9, 0.6, 0.2), # Orange - Tyr
(0.9, 0.9, 0.2), # Yellow - Dab
(0.4, 0.8, 0.4), # Green - Asp1
(0.3, 0.7, 0.7), # Teal - Ala
(0.3, 0.5, 0.9), # Blue - Asp2
(0.5, 0.3, 0.9), # Purple - D-OHOrn
(0.8, 0.4, 0.8), # Magenta - Gly
(0.9, 0.5, 0.6), # Pink - L-OHOrn
]
highlight_atoms = {}
highlight_colors = {}
for i, (res_name, atom_range) in enumerate(residue_indices.items()):
for atom_idx in atom_range:
highlight_atoms[atom_idx] = colors[i]
highlight_colors[atom_idx] = colors[i]
# 4. Draw the molecule
drawer = rdMolDraw2D.MolDraw2DCairo(1400, 800)
drawer.drawOptions().addAtomIndices = False
drawer.drawOptions().padding = 0.05
# Optional: Make bonds thicker for better visibility
drawer.drawOptions().bondLineWidth = 2
drawer.DrawMolecule(mol, highlightAtoms=list(highlight_atoms.keys()),
highlightAtomColors=highlight_colors)
drawer.FinishDrawing()
# 5. Combine with Legend in Matplotlib
mol_png = drawer.GetDrawingText()
mol_img = Image.open(io.BytesIO(mol_png))
fig, (ax_mol, ax_legend) = plt.subplots(1, 2, figsize=(18, 8),
gridspec_kw={'width_ratios': [2.5, 1]})
# Display Molecule
ax_mol.imshow(mol_img)
ax_mol.axis('off')
ax_mol.set_title('Predicted Pyoverdine Precursor (Ferribactin)\nPseudomonas putida (Houseplant Root Isolate)',
fontsize=16, fontweight='bold', pad=20)
# Create Legend
legend_elements = [Patch(facecolor=c, label=l) for c, l in zip(colors, [
'1. L-Glu (becomes side chain)',
'2. D-Tyr (chromophore core)',
'3. L-Dab (chromophore core)',
'4. L-Asp',
'5. L-Ala',
'6. L-Asp',
'7. D-OHOrn (Fe³⁺ ligand)',
'8. Gly (predicted)',
'9. L-OHOrn (Fe³⁺ ligand)'
])]
ax_legend.axis('off')
ax_legend.legend(handles=legend_elements, loc='upper right', fontsize=12,
title='Peptide Sequence', title_fontsize=14, frameon=True,
fancybox=True, shadow=True)
# Add Annotation Box
annotations = """
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
STRUCTURE PREDICTION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
► Side chain: Succinamide
(inferred from pvdN presence)
► Chromophore: Dihydroxyquinoline
(formed from residues 1-3 cyclization)
► Fe³⁺ Binding:
Via 2× Hydroxamate groups (residues 7,9)
and Chromophore catechol.
► Status: NOVEL VARIANT
Sequence distinct from known P. putida types.
"Gly" at pos 8 inferred from "NH2" prediction.
LC-MS/MS required for final confirmation.
"""
ax_legend.text(0.5, 0.0, annotations, transform=ax_legend.transAxes,
fontsize=11, verticalalignment='bottom', horizontalalignment='center',
family='monospace', bbox=dict(boxstyle='round', facecolor='#f8f9fa', alpha=1.0))
plt.tight_layout()
plt.savefig('pyoverdine_final_structure.png', dpi=150, bbox_inches='tight')
plt.show()
More deep research turned up the idea that the mystery X is probably Gly. Gemini made the diagram code above (after I iterated on a worse-looking version for a bit) but made an (unfounded IMO) decision to remove the extra starting N. What do you think? Any revisions to the diagram to make?
Looking at the SMILES, Gemini's decision is actually correct. The first N in N[C@@H](CCC(=O)O)C(=O) is the N-terminus of L-Glu — that's the α-amino group. Having an extra N before that would be chemically wrong (you'd have a hydrazine or similar).
However, I do see one issue worth fixing:
The hydroxylamine groups are written as CCCN(O) which is incomplete SMILES (the nitrogen has an open valence). For N5-hydroxyornithine, it should be CCCNO (hydroxylamine: -NH-OH). RDKit likely added implicit hydrogens, so it renders okay, but it's not strictly correct.
Here's the fix:
# Change these two lines:
"N[C@H](CCCN(O))C(=O)" # D-OHOrn (atoms 50-58)
"N[C@@H](CCCN(O))C(=O)O" # L-OHOrn (atoms 63-72)
# To:
"N[C@H](CCCNO)C(=O)" # D-OHOrn
"N[C@@H](CCCNO)C(=O)O" # L-OHOrn
Also, if the mystery residue is confirmed as Gly (not Lys as speculated earlier), the current diagram and legend already show "Gly (predicted)" which is correct. You might want to update the annotation text to reflect that the "NH2" Minowa prediction → Gly interpretation, rather than the earlier Lys guess.
Want me to make these changes to the code cell?
Note from the author: I'm out of my depth enough to let the AIs agree to disagree on the pretty picture front :) I'm hoping to do some LC-MS which should let us come back and resolve things with more certainty. For now, I'd say take the exact structure with a grain of salt BUT it looks like we do definitely have a novel pyoverdine on our hands.
- johnowhitaker
import py3Dmol
from rdkit import Chem
from rdkit.Chem import AllChem
smiles = "N[C@@H](CCC(=O)O)C(=O)N[C@H](Cc1ccc(O)cc1)C(=O)N[C@@H](CCN)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](C)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@H](CCCNO)C(=O)NCC(=O)N[C@@H](CCCNO)C(=O)O"
mol = Chem.MolFromSmiles(smiles)
mol = Chem.AddHs(mol)
AllChem.EmbedMolecule(mol, randomSeed=42)
AllChem.MMFFOptimizeMolecule(mol)
mb = Chem.MolToMolBlock(mol)
view = py3Dmol.view(width=400, height=300)
view.addModel(mb, 'mol')
view.setStyle({'stick': {'colorscheme': 'greenCarbon', 'radius': 0.15}})
view.setBackgroundColor('white')
view.zoomTo()
view.show()