Impact of international travel and diarrhea on gut microbiome and resistome dynamics

Study design and cohort overview

Our cohort of international travelers was comprised of Spanish language students who were enrolled at the Amauta Spanish School in Cusco, Peru. A clinic staffed by a study physician was established inside the school to enroll and follow subjects. Travelers aged ≥18 years, able to understand written English, and enrolled at the school as students were eligible. The majority of these travelers stayed in lodging provided by the school within its premises. The study physician explained the purpose and procedures of the study to potential participants within 24 to 48 h of their arrival in Cusco city. Those who agreed to participate signed an informed consent form and completed a standardized enrollment questionnaire, that collected baseline demographic and epidemiologic data, as well as medical and travel history (Form A in Supplemental Information). Subjects were also provided a stool collection kit and instructions to collect a baseline stool sample with their next bowel movement. Weekly stool samples were collected from subjects throughout their stay, and a weekly questionnaire collected data on unreported diarrheal episodes and dietary habits. Subjects were instructed to immediately communicate with the study physician when having a diarrheal episode and before taking any medication (e.g., antibiotics). Diarrhea was defined as passing one or more semi-liquid or watery bowel movements associated with gastrointestinal symptoms such as abdominal pain, cramping and/or nausea. Subjects reporting diarrhea were asked to provide an extra stool sample and were examined daily. During the diarrheal illness, 70 participants were prescribed antibiotics (e.g., ciprofloxacin, azithromycin) supplemented with oral rehydration therapy, and the remaining individuals were advised only oral rehydration therapy. Few individuals (n = 8) took antispasmodic and antiparasitic medication. Physical exam findings, symptoms, duration, and medications were documented in an acute diarrhea questionnaire (Form B in Supplemental Information). Diarrheal episodes were treated as independent events if an asymptomatic period of at least seven days occurred between them. Samples were classified as non-diarrhea or diarrhea based on whether they were collected during a diarrheal episode or not. Samples were also given a stool grade based on the Bristol stool form scale (BSS) of stool consistency, ranging from grade 1 (normal, hard) to grade 4 (loose, diarrheal).

The study protocol and its amendments were reviewed and approved by the Institutional Ethics Committee of the Universidad Peruana Cayateno Heredia, the Institutional Review Board of the US Naval Medical Research Unit No. 6 (NAMRU-6), and the Washington University in St. Louis Institutional Review Board.

Empirical selection of diarrheal samples for analysis

Since diarrhea is a complex clinical infection and reliance on self-reported information alone can be misleading, we implemented a multi-factor selection procedure using epidemiological and clinical information to screen diarrheal samples prior to downstream analyses. First, we verified that the diarrheal samples were collected within the documented onset and recovery dates. If the recovery date was missing from the metadata, only diarrheal samples collected within two days of onset were included; if both onset and recovery dates were missing, then only samples that had a stool consistency of 3 or 4 (semi-liquid to liquid) were included. We then implemented a modified scoring scheme^73,74 referred here as the TD score, to further minimize the false positives (Supplementary Table 14). The TD score combined the presence of clinical signs (such as fever, vomiting, dehydration e.t.c.) and macroscopic data (stool consistency) collected during the diarrheal episode into a single metric. These parameters included: (1) stool frequency (in the last 24 h) and consistency; (2) duration of diarrhea; (3) participant-reported dehydration; (4) behavioral signs (e.g., nausea, fatigue, headache, bloating, anorexia) and/or clinical symptoms (e.g., temperature >38 °C); (5) pulse rate >100; and (6) presence of blood in stool. Each parameter was further divided into thirds and assigned equal points based on the severity of the symptom (1 point: bottom-third, 2 points: middle-third, and 3 points: upper-third) and were aggregated across parameters. We analyzed the TD score validity by comparing it with individual factors such as stool grade, maximum stool frequency in 24 h, and duration of diarrhea. While the individual factors did not necessarily correlate with each other, the TD score was significantly correlated with the given factors (Supplementary Discussion). Diarrheal samples with a TD score greater than 1.5 were included in the analysis. Using this multi-faceted filtering strategy, we selected 102 diarrheal samples total for downstream analyses.

Stool specimen processing

All collected stool samples were delivered to the onsite clinic. Samples were refrigerated at 4 °C and preserved in Cary Blair media then transported to the Universidad Peruana Cayetano Heredia laboratory in Cusco, Peru within 2 h. Stool samples from diarrheal episodes that started during the night hours were collected by study personnel and refrigerated at 4 °C until they could be delivered the next morning. In the laboratory, stool samples were examined macroscopically for consistency and appearance, microscopically for parasite detection and fecal leukocytes, and underwent routine stool culture followed by antimicrobial sensitivity testing. Aliquots of fresh stool with no preservatives were frozen at −80 °C. Primary isolates and fresh stool aliquots were shipped in dry ice on a monthly basis to NAMRU-6 in Lima, Peru for further workup and long-term storage. Samples stored at −80 °C were shipped to Washington University in St. Louis, MO where the samples were stored at −80 °C until DNA extraction.

Stool assessment and cultures for enteropathogen detection and antibiotic susceptibility testing of isolates

Stool samples were collected and processed immediately upon arrival at the nearby laboratory in Cusco. Gross examination and Hemoccult card tests (Beckman Coulter, Brea, CA) were used to evaluate for fecal gross and occult blood, respectively. Stool microscopy using direct iodine wet mount, modified Kinyoun stain, and methylene blue stain techniques were performed to examine for the presence of parasite ova, cyst, and fecal leukocytes, respectively as previously described⁷⁵. Bacterial cultures were performed for the presence of pathogenic E. coli, Salmonella spp., Shigella spp., Campylobacter spp., Aeromonas spp., Plesiomonas spp., and Vibrio spp. All primary, enrichment, and biochemical differentiation culture media, BBL Sensi-Discs susceptibility discs, and catalase and oxidase test reagents were purchased from Difco BD Bioscience (Franklin Lakes, NJ). Fresh stool samples and stools preserved in Cary Blair transport media were cultured for bacterial enteropathogens using conventional microbiologic techniques^76,77. Enteropathogens were identified according to their growth on differential and selective agar plates⁷⁸. Stool specimens were streaked onto MacConkey agar, Salmonella–Shigella agar (SS), Hektoen Enteric agar (HE), thiosulfate citrate bile salt sucrose agar (TCBS), and Campylobacter blood-free agar base (CBF). All agar culture plates were incubated at 37 °C for 24 h after inoculation with the exemption of Campylobacter blood-free agar base plates that were incubated at 42 °C for 48 h in microaerophilic conditions. Stool samples were also inoculated into selenite enrichment broth and peptone water as sub-cultures and incubated at 37 °C for 18–20 h after inoculation. After incubation, the selenite enrichment broth was streaked onto HE agar to assess for Shigella spp. and the peptone water was streaked in the TCBS agar to screen for Vibrio spp. and both were incubated at 37 °C for 24 h. Five lactose fermenting colonies with morphology compatible with E. coli were selected from each MacConkey agar plate and preserved for polymerase chain reaction detection of pathogenic E. coli as described below. Other colonies with morphology and characteristics of enteropathogens seen on the MacConkey, SS, HE, and TCBS agars were inoculated in motility indole ornithine agar (MIO), Kliger iron agar (KIA), lysine iron agar (LIA), and Simmons citrate (CIT) and incubated at 37 °C for 24 h. After incubation, enteropathogen differentiation was performed according to their growth characteristics in the different agars and differential biochemical reactions⁷⁸. Colonies growing on the CBF agar were evaluated with oxidase, catalase, and gram stain testing to differentiate Campylobacter colonies. Colonies with morphology and biochemical profile consistent with Shigella were confirmed by agglutination with serotype-specific antisera.

After biochemical identification, all the lactose fermenting and enteropathogens colonies identified were shipped to NAMRU-6 in Lima for quality control and further workup. For transportation, lactose fermenting colonies were inoculated again in MacConkey agar and Campylobacter colonies were inoculated in CBF agar, all were incubated at 37 °C for 24 h, and isolated colonies resuspended in cryovials with trypticase soy broth with 15{32bc5e747b31d501df756e0d52c4fc33c2ecc33869222042bcd2be76582ed298} glycerol, frozen at −70 °C, and shipped overnight on dry ice. The colonies identified as Salmonella spp., Shigella, Aeromonas spp., Plesiomonas spp., and Vibrio spp. were inoculated in vials containing trypticase soy agar, incubated at 37 °C for 24 h, and shipped overnight at 4–8 °C.

All identified enteropathogen colonies were inoculated on Mueller-Hinton (MH) agar to perform antibiotic susceptibility testing using the Kirby Bauer disk diffusion method according to the Clinical and Laboratory Standards Institute (CLSI) guidelines⁷⁹. Antibiotic susceptibility testing was performed with 17 antibiotic discs (amoxicillin-clavulanate, ampicillin, azithromycin, ceftriaxone, chloramphenicol, cephalothin, cefepime, ciprofloxacin, rifaximin, amikacin, gentamicin, imipenem, tetracycline, trimethoprim-sulfamethoxazole, ticarcillin, ticarcillin/clavulanate, and furazolidone). Colonies were suspended in sterile saline and turbidity adjusted to the 0.5 McFarland standard before plating onto MH agar using the lawn streak technique. Antibiotic discs were added with proper spacing between and plates were incubated at 37 °C for 18–20 h after inoculation. Zones of inhibition were measured and recorded as the diameter of growth inhibition, with interpretation of susceptibility or resistance determined using CLSI M100 published breakpoints. It is important to note that currently there is no established CLSI breakpoint for determining the azithromycin resistance in E. coli. However, azithromycin has been found to be highly effective and is frequently used in treating diarrheal infections caused by Gram-negative pathogens, including diarrheagenic E. coli. In this study, we defined azithromycin resistance as per⁸⁰, where the minimum inhibitory concentration of azithromycin was determined in accordance with CLSI guidelines using the agar dilution method on all isolates with halo diameter <15 mm.

PCR detection of pathogenic E. coli

Five lactose-positive colonies were selected randomly from each participant stool culture plate and shipped frozen at −70 °C to NAMRU-6 in Lima for quality control and real-time florescence-based multiplex PCR for the detection of the currently recognized classes of diarrheagenic E. coli as previously described^81,82.

Frozen stocks of isolated colonies were restreaked onto MacConkey agar and incubated at 37 °C for 18–20 h. For DNA extraction, a single lactose-positive colony was carefully removed from the MacConkey agar plate using a sterile toothpick to avoid agar contamination and placed individually into a vial with 100 μl of sterile molecular-grade water. DNA was extracted by boiling at 100 °C for 5 min, then 5 min on ice, followed by centrifugation at 14,000 rpm for 10 min at 4 °C.

Primers were designed as previously described⁸² to detect eight different virulence genes simultaneously in a single reaction, including aggR for enteroaggregative E. coli (EAEC), ST1a/ST1b, and LT for enterotoxigenic E. coli (ETEC), eaeA for enteropathogenic E. coli (EPEC), stx1 and stx2 for Shiga toxin-producing E. coli (STEC), ipaH for enteroinvasive E. coli (EIEC), and daaD for diffusely adherent E. coli (DAEC)⁸². The primers were designed so that amplicons had differential melting temperatures (TmS) ranging from 77 to 95 °C. The primers were diluted with TE buffer pH: 8.0 to obtain a 100 μM stock concentration, then further diluted to 25 μM working concentration with molecular grade water.

Two microliters of the DNA extraction lysate was used as a template added with 23 μl PCR master mix for a final reaction volume of 25 μl including Phusion High Fidelity buffer (Finnzyme OY, Espoo, Finland) and 0.5 U Phusion polymerase with 200 μM deoxynucleoside triphosphates and 4 mM MgCl₂ (ThermoScientific, Waltham, MA). Sybr green I was added as recommended by the manufacturer (Cambrex Bio Science, Rockland, ME). The qPCR was performed on Rotor-Gene Q PCR thermocycler with software version 1.7 (Qiagen, Germantown MD). Cycling conditions were 98 °C for 50 s, 60 °C for 20 s, 72 °C for 30 s, and 75 °C for 1 s over a total of 25 cycles, after which melting curves were determined with a ramp speed of 2.5 °C/s reading each 0.2 °C increments between 73 and 95 °C⁸². Melting peaks were calculated by the software to allow the interpretation of reactivity for the different virulence factor genes. Reference strains for each pathogenic E. coli were included as positive controls in the reaction.

RT-PCR for norovirus detection

Stool specimens were transported in portable coolers on ice packs to the field laboratory, aliquoted, and were stored at –80 °C until they were shipped in batches on dry ice to the NAMRU-6 laboratory in Lima for testing. Specimens were treated as described previously⁸³ with some modifications. In brief, fecal samples were diluted 10{32bc5e747b31d501df756e0d52c4fc33c2ecc33869222042bcd2be76582ed298} in 1× phosphate-buffered saline, vortexed, and centrifuged at 5943 × g for 10 min at 4 °C⁸⁴. Total RNA was extracted from stool-diluted samples using the QIAmp Viral RNA mini kit (Qiagen, Valencia, California) as described by the manufacturer. RNA was eluted with 60 µL of 0.01{32bc5e747b31d501df756e0d52c4fc33c2ecc33869222042bcd2be76582ed298} of RNAse inhibitor (Qiagen) in DEPC Treated Water (Invitrogen, Life Technologies, Waltham, MA). RNA samples were stored at −80 °C until use. Norovirus detection was performed using a Duplex genogroup-specific real-time reverse-transcription polymerase chain reaction (qRT-PCR) developed and described by the National Calicivirus Laboratory at the Centers for Disease Control and Prevention in a 7500 FAST real-time platform (Applied Biosystems, Waltham, MA)^85,86. RT-PCR was performed at 45 °C for 10 min and 95 °C for 10 min followed by 45 cycles of 15 seconds at 95 °C and 1 min at 60 °C using primers and probes from previous assays^87,88. A sample was considered positive for norovirus GI when the calculated cycle threshold (Ct) was less than or equal to 37 cycles and considered positive for GII when the Ct was less than or equal to 39 cycles. Test results were only considered valid when all quality control positive reactions were below Ct cutoff and did not exhibit fluorescence above the threshold for the negative template control reactions.

Metagenomic DNA extraction and sequencing

Metagenomic DNA (mgDNA) was extracted from 300–400 mg stool samples using the phenol-chloroform repeated bead-beating protocol as described previously²¹. The extracted mgDNA was diluted to 0.5 ng/μL and sequencing libraries were prepared using a modified Nextera protocol⁸⁹. The libraries were purified using the Agencourt AMPure XP system (Beckman Coulter) and quantified using the Quant-iT PicoGreen dsDNA assay (Invitrogen). For each sequencing lane, 10 nM of approximately 96 samples were pooled three independent times. These pools were quantified using the Qubit® dsDNA BR Assay and combined in an equimolar fashion. Samples were submitted for 2x150bp paired-end sequencing on an Illumina NextSeq-High Output platform at The Edison Family Center for Genome Sciences & Systems Biology at Washington University in St. Louis.

Metagenome profiling

Metagenomics raw reads were processed to remove Illumina adapter/index sequences and low-quality reads using Trimmomatic (v0.36)90 with the following parameters ILLUMINACLIP:NexteraPE-PE.fa:2:30:10:1:true; SLIDINGWINDOW:4:15; LEADING:10 TRAILING:10; MINLEN:60. The removal of human reads contaminants was performed using deconseq (v0.4.3)91. Post-cleaning, 710 samples with >3 million reads were used in downstream analyses. The taxonomic composition was determined using Metaphlan2¹⁷. Taxa with <0.01{32bc5e747b31d501df756e0d52c4fc33c2ecc33869222042bcd2be76582ed298} relative abundance in >90{32bc5e747b31d501df756e0d52c4fc33c2ecc33869222042bcd2be76582ed298} of samples were removed, resulting in 143 species included in downstream comparative analyses.

Functional metagenomics

We constructed 21 small-insert (2–5 kb) functional metagenomic libraries, which were screened against 17 antibiotics (Supplementary Data 1) based on our previously published protocols^{19,20,21,22,23}. The experimental protocol is described below:

Library preparation

The mgDNA extracted from 10 randomly selected stool samples were pooled together for the construction of each individual functional metagenomic library. The pooled mgDNA was sheared with Covaris E220 sonicator following the manufacturer’s recommended settings for a target size of 3 kb (DNA mass: 2–20 µg of mgDNA diluted in 200ul of Buffer EB, intensity: 0.1, duty cycle: 20{32bc5e747b31d501df756e0d52c4fc33c2ecc33869222042bcd2be76582ed298}, cycles per burst: 1000, treatment time: 600 sec). The sheared mgDNA was size-selected by gel-electrophoresis (1{32bc5e747b31d501df756e0d52c4fc33c2ecc33869222042bcd2be76582ed298} agarose, 0.5× Tris-borate-EDTA, GelGreen dye (Biotium), 70 V for 130 min). The band covering 2–5 kb fragment size was excised, then mgDNA fragments were recovered using the QIAquick gel extraction kit (Qiagen). The size-selected mgDNA fragments were end-repaired with the END-It^TM DNA End Repair kit (Epicenter) and purified with the QIAquick PCR purification kit (Qiagen). The end-repaired mgDNA was quantified with the Qubit HS fluorometer assay kit (Invitrogen) and concentrated using a vacuum concentrator (SpeedVac®) to a final volume of ~10 µl. The mgDNA fragments were then ligated into the pZE21-MCS-1 vector at the BamHI site. Linearization of the pZE21 vector was performed by inverse PCR using the following reaction conditions: (1) Mix 10 µl of 10× reaction buffer, 1.5 µl of 10 mM dNTP mix, 1 µl of MgSO₄, 1 µl of 100 pg/µl circular pZE21 plasmid vector, 0.75 µl forward primers (5’ GACGGTATCGATAAGCTTGAT 3’), 0.75 µl reverse primers (5’ GACCTCGAGGGGGGG 3’), 0.4 µl blunt-end HF DNA polymerase and 29.6 µl of nuclease-free water to a final volume of 50 µl. (2) PCR settings: 95 °C for 5 min, then 35 cycles of [95 °C for 45 sec, 55 °C for 45 sec, 72 °C for 2.5 min], then 72 °C for 5 min. The linearized pZE21 plasmid vector was size-selected (~2200 bp) using gel-electrophoresis, purified with the QIAquick PCR purification kit (Qiagen), and then dephosphorylated with calf-intestinal alkaline phosphatase (CIP, New England BioLabs) by adding 40 µl of gel-purified DNA, 5 µl of CIP (10 U/µl), and 5 µl of the 10× reaction buffer. The 50 µl reaction was incubated overnight at 37 °C before the CIP was heat-inactivated by incubating the reaction mix at 70 °C for 15 min. The ligated plasmid DNA was then purified using the QIAquick PCR purification kit (Qiagen) dialyzed with cellulose membrane (Millipore, VSWP09025) for 30 min, and then electroporated into E. coli MegaX DH10B (Invitrogen) following the manufacturer’s instructions. After transformation, cells were recovered in 1 ml of recovery medium (Invitrogen) for 1 h at 37 °C. The library titers were determined by plating 0.1 µl and 0.01 µl of recovered cells onto Luria-Broth (LB) agar plates containing 50 µg/ml kanamycin as previously described²⁰. The remainder of recovered cells were grown overnight in 50 ml of LB broth containing 50 µg/ml kanamycin (LB-Kan). The culture was then centrifuged and resuspended in 15 ml of LB-Kan broth containing 15{32bc5e747b31d501df756e0d52c4fc33c2ecc33869222042bcd2be76582ed298} glycerol and stored at −80 °C for subsequent screening. Functional screening of antibiotic resistance: Each metagenomic expression library was screened on Mueller-Hinton agar with 50 µg/ml kanamycin and one of the 17 antibiotics at concentration listed in Supplementary Data 1. Before plating each library on antibiotic-containing growth media, the concentration of each library was adjusted such that 100 µl of library freezer stock contained at least 10× the total number of unique clones as determined at the time of library creation. To adjust the concentration, the freezer stock solution was either diluted with MH-Kan or centrifuged and reconstituted again in the appropriate volume for plating. The antibiotic selection plates were incubated for 24 h at 37 °C to allow growth of antibiotic-resistant clones. Additionally, for each antibiotic selection, a negative control plate of E. coli MegaX DH10B strain transformed with unmodified pZE21 (i.e., without a metagenomic insert) was plated to ensure that the concentration of antibiotic used entirely inhibited the growth of clones containing unmodified pZE21. The surviving colonies from each antibiotic selection were collected by adding 750 µl of LB-Kan with 15{32bc5e747b31d501df756e0d52c4fc33c2ecc33869222042bcd2be76582ed298} glycerol, and colonies were scraped with an L-shaped spreader. The slurry of antibiotic-resistant clones removed from the surface of the plate was stored at −80 °C before sequencing them with Illumina NextSeq platform.

Sequencing, assembly, and annotation of functionally-selected resistance determinants

The plasmid DNA containing antibiotic-resistant mgDNA fragments was extracted from functionally-selected clones using the QIAprep Spin Miniprep kit (Qiagen), and prepared for sequencing with a Nextera protocol, as described above. The samples were submitted for sequencing using an Illumina NextSeq platform (2 × 151 bp paired-end reads). Reads from each antibiotic selection were assembled into contigs using PARFuMS¹⁹, a tool specifically designed for high-throughput assembly of resistance-conferring DNA fragments from functional selections. Selections were excluded from analysis if: (1) the number of contigs assembled was 10 times more than the total number of colonies; or (2) >200 contigs were assembled. Contigs were also filtered based on length (>500 bp). A total of 7020 contigs were obtained, and 16,334 open reading frames (ORF) were predicted in these contigs using the gene-finding algorithm Prodigal⁹⁰. These ORFs were then annotated using an in-house pipeline called resAnnotator.py, which follows a hierarchical approach: (1) ORFs are searched against BLAST-based ARG databases (CARD⁴⁶, ResFinder⁴⁶, and AMRFinder-Prot⁴⁷) with high percent-identity (>95{32bc5e747b31d501df756e0d52c4fc33c2ecc33869222042bcd2be76582ed298}) and coverage (>95{32bc5e747b31d501df756e0d52c4fc33c2ecc33869222042bcd2be76582ed298}); and (2) the remaining ORFs are annotated using HMM-based ARG databases (Resfams⁹¹, AMRFinder-fam⁴⁷). Overall, 1233 unique ARGs were identified.

Resistome profiling of metagenomics samples

The relative abundance of ARGs in the metagenomics samples was calculated using ShortBRED¹⁸(v0.9.4). First, we built a high-precision ARG-specific markers database from 7,921 antibiotic resistance proteins that were used as proteins of interest for identification of marker families using ‘shortbred_identify.py’ with the following non-default parameters: –-clustid 0.95 –-ref Uniref90⁹². The antibiotic resistance protein sequences include sequences from the CARD database⁴⁶, the NCBI-AMR database⁴⁷, and antibiotic resistance proteins identified using functional metagenomics in this cohort, and previous studies^{19,21,22,23,93,94,95,96,97}. In total, we generated a comprehensive set of markers database consisting of 6,585 unique marker sequences representing 2,331 AMR gene families. These AMR gene families were then manually curated, and entries with the following criteria were removed from analysis consideration because they would not be confidently expected to provide resistance based solely on a short-read marker (e.g., when that gene would require other components to provide phenotypic resistance, or when short-read markers would not distinguish between susceptible vs. resistant versions of an antibiotic target):

(1)

Genes associated with global gene regulators, two-component system proteins, and signaling mediators (e.g., blaZ, vanS-vanR, mecI, mepR, gadW, marR);
(2)

Genes encoding subunits that are part of multiple efflux pumps (e.g., tolC, oprM, opmD);
(3)

Resistance via mutation in genes (e.g., resistance to antifolate drugs via mutations in dhfr, resistance to rifamycin via mutation in rpoB);
(4)

Genes conferring resistance by modifying cell wall charge (e.g., mprF);
(5)

Genes that reduce permeability (e.g., omp38, tmrB) or confer resistance through overexpression (e.g., Thymidylate synthase); and
(6)

General efflux pumps that came through functional selections (e.g., MFS-type, ABC-type)

The relative abundance of AMR gene families was quantified by mapping reads to the filtered set of marker sequences using ‘shortbred_quantify.py’. ShortBRED hits were filtered-out if they had counts lower than 2, or a mean reads per kilobase million (RPKM) lower than 0.001.

Bacterial DNA isolation, sequencing, and assembly

The E. coli isolates stored at −80 °C in 15{32bc5e747b31d501df756e0d52c4fc33c2ecc33869222042bcd2be76582ed298} glycerol were inoculated in 1.5 ml tryptic soy broth and were grown overnight at 37 °C with shaking. Genomic DNA was extracted using the Biostic Bactermia DNA isolation kit as per the manufacturer’s protocol. DNA concentration was quantified using the Qubit fluorometer and stored at −20 °C. Isolate sequencing libraries were prepared using the Nextera protocol⁸⁹, and were sequenced on the Illumina NextSeq platform with 2 × 150 paired-end reads. Raw sequencing reads were binned by index sequences, quality filtered using Trimmomatic(v0.36)⁹⁸ and human contaminants were removed by DeconSeq (v0.4.3)⁹⁹. Processed reads were assembled into draft genomes using SPAdes (v3.11.0)¹⁰⁰ with the following parameters: spades.py -k 21,33,55,77 –careful. Assembly quality was assessed using CheckM¹⁰¹ and the following inclusion criteria were used for downstream analysis: >90{32bc5e747b31d501df756e0d52c4fc33c2ecc33869222042bcd2be76582ed298} completeness, <5{32bc5e747b31d501df756e0d52c4fc33c2ecc33869222042bcd2be76582ed298} contamination, and <500 contigs longer than 500 bp.

Genomic analysis of E. coli isolates

The assemblies of 189 E.coli isolates and 40 publicly available E.coli reference genomes were annotated using Prokka (v1.12)¹⁰² with default parameters. Multilocus sequence types and serotype were determined using MLST (https://github.com/tseemann/mlst) and SerotypeFinder¹⁰³.

In silico detection of antibiotic resistance and virulence genes in the isolate genomes was performed using the in-house pipeline resAnnotator.py, which sequentially annotates ORFs by searching against the given Blast- and HMM-based databases. For antibiotic resistance genes, we screened against the following databases in order: Resfinder¹⁰⁴, NCBI-AMR⁴⁷, CARD⁴⁶, and Resfams⁹¹. For virulence gene identification, VirulenceFinder^105,106 and VFDB¹⁰⁷ were used. Mobile genetic elements in the isolate genomes were predicted by MGEFinder⁵⁴. Average nucleotide identities among isolates were computed using pyani (https://github.com/widdowquinn/pyani). The pangenome analysis was performed using Roary⁴⁹, with core-genome alignments created from assembled contigs of all E. coli isolates (189 sequenced and 40 published reference genomes). The reference genome of E. marmotae was also included as an outgroup. The maximum-likelihood core-genome phylogenetic tree was constructed using RAXML (v8.2.11)⁵⁰ with the following parameters: -m GTRGAMMA -f a -N 1000 -x 25418 and visualized using iTOL¹⁰⁸.

Statistical analysis

Identification of covariates

To identify the metadata variables that significantly affected the taxonomic profile, functional profile, and resistome profile, we calculated the total variance contribution of each variable using PERMANOVA with repeated measures, as described previously²⁵. To account for repeated measurement in metadata variables that change over time (e.g., stool consistency, sample type, pathogen presence), permutations were limited to within the subjects. Meanwhile, variables that were constant across samples from the same subject (e.g., age, sex, travel history, lodging, country of origin) were first permuted across subjects, with samples re-labeled with the variable from their permuted subject. The significance value for metadata variables was determined using 5,000 permutations and p values that were FDR corrected for multiple-hypothesis testing. The metadata variables that showed significant variation among these profiles were included in the models as covariates.

Longitudinal changes in alpha- and beta-diversity

To model the effect of length of travel and diarrhea on gut microbial and AR gene diversity, a linear mixed effect model (LMM) (for modeling continuous response variables, e.g., Shannon index) or generalized linear mixed effect model (GLMM) with Poisson family (for modeling count data e.g. richness) was fit by maximum-likelihood using lmer or glmer function in R, respectively. These models include regression on alpha diversity, beta-diversity, and cumulative change in abundance measures against diarrhea and time while also adjusting for the effects of age, sex, region, and inter-individual variability by specifying subjects as the random effects. Pseudo-R2 was determined using r.squaredGLMM function in the MuMin package. Post hoc pairwise comparison between different Sample Type (e.g., DiarrheaTD vs Non-diarrhea PostTD) was performed on fitted models using emmeans package and P values were corrected for multiple hypotheses using Benjamini–Hochberg method (FDR).

Microbial features association analysis

To identify microbial features (taxa or ARGs) associated with diarrhea, we performed multivariable association analysis using MaAsLin2²⁷ that accounts for multiple covariates and repeated measures. Briefly, Masslin2, first applies an appropriate transformation/normalization method: arcsine square-root transformation for taxonomic profile and log transformation with half the minimum relative abundance as a pseudo count. Then, the transformed abundances of each feature were fit with the following linear mixed-effects model:

$$microbial,feature,abundances,(taxa,or,AR,genes), sim ,Sample_type \ left.+Age+Sex+Region+AbxUse+Time,in,Peru(days)+1Subjectsright)$$

where each microbial feature was modeled as a function of diarrhea while adjusting for the effect of metadata variables (significant covariates that came from PERMANOVA analysis) as well as accounting for inter-individual variability by specifying the subjects as random effects. Further, p values of each feature were corrected for multiple-hypothesis testing using Benjamini–Hochberg method with FDR < 0.25.

Microbiome network analysis

We inferred two unsupervised co-occurrence networks from diarrhea (case) and non-diarrhea (control) samples at the species-level using the SparCC⁴⁴ algorithm which calculates the correlations between the microbial species while taking into account the sparsity and inherent compositionality in the microbe relative abundance data. To account for differences between the number of diarrheal and non-diarrheal samples, we took subset of samples to construct co-occurrence networks. For the diarrhea network, only the first diarrhea samples per individual were considered, whereas non-diarrheal samples that were collected within the first two weeks of stay and were not flanked by a diarrheal episode were included in the non-diarrhea network. For the given diarrhea and non-diarrhea datasets, SparCC was used with default parameters to calculate the correlations among species, and then the “MakeBootStrap” command was applied to generate 100 bootstrap tables, which were again used to calculate the SparCC correlations. Finally, the bootstrapped correlations were used with the “PseudoPvals.py” command to generate two-tailed p values from the true table. The correlation values with p values <0.05 were retained. We then compared the two co-occurrence networks using NetShift⁴⁵ which quantifies the changes in the interactions of the individual node to identify the “driver taxa”. The key parameters for identifying the driver taxa were the NESH score, Jacard index and delta betweenness centrality. The co-occurrence networks were drawn using the igraph package in R.

Microbiome shift events

Microbiome “shift”²⁵ events were defined as when the Bray–Curtis dissimilarity index between two consecutive samples from a single individual was more likely to have come from the distribution of the Bray–Curtis dissimilarity index derived from samples of different individuals. Based on this definition, we first obtained the distribution of the Bray–Curtis dissimilarity index between samples from different individuals of the HT cohort and the Bray–Curtis dissimilarity index between samples from the same individuals of the HT cohort. The point at which the inter-individual dissimilarity estimate exceeded the intra-individual dissimilarity estimate was chosen as the threshold to define the “microbiome shift” event (Bray–Curtis dissimilarity: 0.52).

Machine learning classification model

To distinguish diarrheal from non-diarrhea samples based on taxonomic composition, we built machine learning classification models using SIAMCAT⁴³, an ML-based tool box developed specifically for metagenomics studies¹⁰⁹. The SIAMCAT⁴³ workflow consists of normalization of taxonomic features, splitting the datasets using cross-validation schemes (fivefolds), training with different machine learning models (LASSO logistic regression, Random Forest, and Elastic Net), and evaluating the performance of the models using area under the receiver operating characteristic curve.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.