Sunday, 12 May 2013

Identifying Wild Yeast Part Deux

From XKCD.com
So my first go at identifying wild yeast using ITS sequencing did not work as well as I had hoped. Of course, after setting up the system I learned that the method I was using was best for non-yeast fungi, rather than for yeast. Luckily one of my few readers, Sam - author of eurekabrewing - came to the rescue with a few papers showing better ways to do it.

So I've revamped the plan and will use a slight variation of my old method. Previously I was trying to sequence the regions of the genome between pieces of the ribosome (full explanation of the method and what a ribosome is can be found here). Instead, I'm going to sequence part of the 26s portion of the ribosome (26S rRNA). The specific part I'm amplifying has been used successfully to identify yeast in the past, including the very species of yeast we're likely to find in a wild brew1-3.

As always, the meat is below the fold...

Legend



The Method

So the method is easy - its just a minor adjustment to the old method. The only thing that is different is the primers used to amplify the DNA - instead of using the ITS1 & ITS4 primers, we instead use the NS1 and NS4 primers:

  • NS 1: 5' - GCATATCAATAAGCGGAGGAAAAG - 3'
  • NS 4: 5' - GGTCCGTGTTTCAAGACGG - 3'
The resulting PCR should amplify ~570bp of DNA - previous studies have shown that different species of yeast should have between 6 and 150 mutations relative to good ol' Saccharomyces cerevisiae. Easy as pie - I hope...

Here's the play-by-play:
  1. Setup a 30ul PCR using 2ul of the DNA prepared last time, PFU polymerase, and the NS1 & NS4 primers. PCR conditions are 40 cycles of:
    1. 96C/30sec melt
    2. 50C/30sec annealing
    3. 72C/45sec extension
  2. Run PCR products on a 2% agarose gel & purify the DNA
  3. Sequence using the NS1 and NS4 primers*
* For the first test I am maximizing my chance of success by sequencing the whole PCR product by sequencing using both primers. However, reference 2 shows that most of the genetic differences should be near the 3' end of the sequence, and thus future experiments may sequence using only NS1.


The Main Results

Once again, Brett PCR'd nicely;
the other yeasts, not-so-much
As last time, only the Brett sequences amplified. But they amplified very well, producing products ~600bp in size. This can be seen on the gel to the left - the B. bruxellensis & lambicus produced strong bands at the expected size, while the Irish ale (1084) and mystery yeast from my clubs president's ruined batch (Pres) did not amplify.

This lack of amplification could be due to a number of problems. I can think of two issues. Firstly, the 1084 & Pres strains are/may be commercial yeast strains. As these strains were selected for by brewers they lost a lot of the characteristics which would allow them to survive in the wild. As such, they may not be as tough as the more-wild Bretts, thus causing the DNA to be destroyed in the fairly rough handeling processing used to purify DNA. Secondly, there was a lot more yeast used in the Pres & 1084 isolations - perhaps I saturated the system and the DNA ended up stuck in the protein/chloroform layers of the purification.

Either way, I'm going to try an isolation "free" method next time, which is described later in this post - in the section bearing the astoundingly astute title 'Next Time'...

As for the two Brett sequences which worked, they were packed up and sent for sequencing. I messed up my preparation, so only the lambicus got sequenced form both ends. The resulting sequences were BLASTed against NCBI's full nucleotide collection, as well as aligned relative to the canonical Saccharomyces cerevisiae4-5 and Brettanomyces bruxellensis6 26S rRNA sequences.

Sample Sequence ID via BLAST % Align
B.bruxellensis
(WLP650)
AGTAGCGGCGAGTGNNGCGGCAAAAGCTCAAATTTGAAATCTGGCGCCTTCGG
TGTCCGAGTTGTAATTTGAAGATTGTAACCTTGGGGTTGGCTCTTGTCTATGT
TTCTTGGAACAGGACGTCACAGAGGGTGAGAATCCCGTGCGATGAGATGCCCA
ATTCTATGTAAGGTGCTTTCGAAGAGTCGAGTTGTTTGGGAATGCAGCTCTAA
GTGGGTGGTAAATTCCATCTAAAGCTAAATATTGGCGAGAGACCGATAGCGAA
CAAGTACAGTGATGGAAAGATGAAAAGAACTTTGAAAAGAGAGTGAAAAAGTA
CGTGAAATTGTTGAAAGGGAAGGGTTTGAGATCAGACTCGATATTTTGTGAGC
CTTGCCTTCGTGGCGGGGTGACCCGCAGCTTATCGGGCCAGCATCGGTTTGGG
CGGTAGGATAATGGCGTAGGAATGTGACTTTACTTCGGTGAAGTGTTATAGCC
TGCGTTGATGCTGCCTGCCTAGACCGAGGACTGCGATTTTATCAAGGATGCTG
GCATAATGATCCCAAACCGCC
Hundreds of equal score 99% or more across >95% of the sequence
B. lambicus
(WLP653)
CGCCTANATACCTCGNTNTGCTAATCCTGTTACNNGNNGCTGCTGCCAGNGNN
GATAAGTCGTGTCTTNCNGGGTTGGACTCAAGACGATAGTNNCCGGATAAGGC
GCAGNGTCGGGCTGNNCGGGGGNTTCGNGCACNCAGCCCNNCTTGGAGCGAAC
GACCTACNCCGANNTGAGANNACNTNCAGCGTGAGCTATGAGAAAGCGCCACG
CTNCCCGAAGGGNNAANGGCGGACAGGTATCCGGTNAAGCGGCAGGGTCGGAA
CAGGAGAGCGNNCGAGGGAGCTNCCAGGGGGAAACGCCNGGTATCTTNATAGT
CCTGTCGGGTTTCGCCACNTNNGACTTGAGCGTCGAANNNTTTGCATNTTNNN
TANNNGANGAAAAGAAACCAACAGGNATTGCCCCAGTAGCGGNGAGTGAAGCG
GCAAGAGCTCAGATTTGAAATCGTGNTAATTTTTTTGGCACGAGTTGTAGAGT
GTAGGCGGGAGTCTTTGTGGAGCACGGTGTCCAAGTCCCTTGGAACAGGGCGC
CTGAGAGGGTGAGAGCCCCGTGGGGTGCCGTGCGAAGCTTTGAGGCCCTGCTG
ACGAGTCGAGTTGTTTGGGAATGCAGCTCCAAGTGGGTGGTAAATTCCATCTA
AGGCTAAATACTGGCGAGAGACCGATAGCGAACAAGTACTGTGAAGGAAAGAT
GAAAAGCACTTTGAAAAGAGAGTGAAACAGCACGTGAAATTGTTGAAAGGGAA
GGGTATTGGGCCCGACATGGGGAGTGCGCACCGCTGTCTCTTGTAGGCGGCGC
TCTGGGCGCTCTCTGGGCCAGCATCGGTTCTTGCTGCGGGAGAATGGGTGCCG
GAAAGTGGCTCTTCGGAGTGTTATAGCCGGCGCCAGATACCGCGTGCGGGGAC
CGAGGACTGCGGCTC
Hundreds of equal score 99% or more across >95% of the sequence
Note: 'Sample' indicates the yeast strain sequenced, 'ID via BLAST' is the closest match in the NCBI nucleotide database, '% Align' is the % positive alignment with the closest matches

Again, the extreme density of data in the NCBI database bites me in the ass. Over 1200 strains matched, with >99% accuracy, my sequences. The good news is that my two sequences are not 100% identical - not counting the 4 places where a sequence had an unknown ('n') in the sequence, there are 388 out of 519 bases which match (75% match). That was far more variability than I had expected!

So what happens if we compare these only to the matching genera - i.e. brettanomyces or dekkera? Our brux strain matches, 100% on ~90% of sequenced bases, with multiple Brettanomyces naardenensis strains. If we further limit our search to just bruxellensis strains we also get 100% matches, on short gene segments. Our lambicus strains aligns most closely with Dekkera anomala. As with the bruxellensis-restricted search, all of the anomala sequences in NCBI are partials, so we only get a partial match with our sequences.

I think that latter issue is going to limit the usefulness of this method - because many of the sequences in the NCBI are partial sequences - i.e. fragments of genes instead of entire genes/genomes - matches end up being biased towards sequences which match modestly well across the whole of the sequence I enter, instead of perfect matches between my larger sequences and the shorter fragments typical of the bruxellensis and anomala sequences in the database. I.E. the system biases towards longer segments with a few mis-matches over shorter sequences with perfect matches.

What does this mean for this method? Unfortunately, it means that its not as useful as I had hoped it would be. Not because the method is flawed, but rather because the coverage quality of genome sequences in the NCBI database are variable. That doesn't mean that this method is useless - it does, however, mean that I need to first acquire some more conventional information on the yeasts I'm trying to identify - i.e. reduce the list of possibles to a smaller number of species, and then BLAST the 26S sequences against that list. The 'possibles' list can simply be built from a mixture of morphology and some basic biochemical observations (e.g. bromocresol green metabolism). Likewise, as I acquire my own library of high-quality sequences, I can BLAST new samples against those, providing a beer-orientated "short list" of sequences to compare to.

If you've made it this far in this already-long write-up, congratulations! While I've exhausted the basic discussion of the results at this point, I did have a bit of a nerd-gasm and did some deeper analysis. This is discussed below - warning, its discussed without my usual attempts at explaining a lot of the science lingo, and includes a bit of a rant on an irrelevant topic...


In Which I Totally Nerd Out

The discussion I have been having with Sam (author of Eureka Brewing) in my last yeast sequencing attempt has motivated me to geek-out a little and write a but more in-depth analysis of the results of this trial. But before I begin I should comment on one 'issue' is the classification of species and how this relates to our friend Brettanomyces.

Prior to the age of modern genetics, species were classified based on biochemical, physical and other properties. This form of classification is termed "systematics". With the modern genetic era a new (circa 1963) method of classifying organisms developed where the patterns of genetic inheritance are used to classify organisms. This 'new' method is termed "cladistics". As odd as it may sound, the systemists maintain a stranglehold on the naming of new species, and while 99% of the time the systemists & cladists agree, their occasional disagreements tend to lead to vigorous - and often less-than-civil - debate. There are legitimate scientific issues on both sides of the fence - systamists often have issues when closely related species undergo convergent evolution, thus creating the appearance of one species where two exists, while cladists face issues of determining if the genes being used to compare species are truly homologous. As our knowledge of genetics has improved, this limitation of cladistics is fading, but as you will soon see, remains an issue. Arguments on both sides are further complicated by the fact that the term 'species' is not clearly defined, while the terms "strain" and "subspecies" remain ambiguous terms without accepted definitions.

So that aside, I'm now going to take a cladistic approach to my sequences. Simply because cladistics is the right way to do things - and now you know which side of the phylogeny war I am on...

The genetic relationship between various Brettanomyces
species/strains. Note that B. lambicus appears in two
locations in the figure, indicating that two sub-strains of B.
lambicus
exist. One of these clusters with B. bruxellensis,
while the other falls between two recognized Brettanomyces
species - bruxellensis and anomalus.
From : The Yeasts - A Taxonomic Study
Brettanomyces is (or at least, has the potential to be) one area of potential systamist/cladist disagreement. Systamists have long held B. lambicus to be a synonym for B. bruxellensis - i.e. two names for the exact same thing. As you can see in the above figure, which uses a simple method to measure relative genetic differences, this is absolutely true for some lambicus strains. At the top of the diagram, represented by the black area, is a clade (group of closely related organisms) including all of the prototypical B. bruxellensis, as well as a strain of B. lambicus. Clearly, this one strain of lambicus is identical to bruxellensis - i.e. they are the same thing.

But, mid-way down the diagram is another strain of B. lambicus, one that is closer related to B. anomolus (which is widely agreed to be a separate species), than it is to B. bruxellensis. And here we see an example of the failure of systematics - as measured by a (crude) genetic profile, some strains of B. lambicus form what is either a unique clade (i.e. group of strains unique from other groups of related organisms) or are an evolutionary intermediary between B. bruxellensis & anomolus. Based on the above diagram, we cannot determine which of the two possibilities is true, but with todays modern genetics we can do such a comparison.

Below I've included a genetic tree comparing the 26S rRNAs of the two strains I sequenced here (WLP 653 [lambicus] & 650 [bruxellensis]) along with 'default' B. anomolus, bruxellensis and Saccharomyces cerevisiae strains for comparison. The tree itself was build using phylogeny.fr's one-click tree generator, and was restricted to the variable (D1, D2 and D3) regions of the 26S rRNA. I did this restriction as this region was the only one available for anomala. By restricting to this segment we reduce any bias induced by different sequence lengths:

Phylogenic tree comparing my B. lambicus (WLP653) B. bruxellensis (WLP650) sequences to "standard"
sequences of B. bruxellensis, anomala, and conventional brewers yeast (Saccharomyces cerevisiae).
Unexpectedly, our two brewing strains fall far off of the "default" sequences for their supposed species (B. bruxellensis), suggesting that significant evolution has occurred during their pseudo-domestication in a lambic brewery. Amazingly, the B. lambicus strain (WLP653) is in a completely separate clade from the other brett's, while the B. bruxellensis strain (WLP650) appears to be an outlier when compared to both "default" B. bruxellensis and an accepted separate species B. anomala.

On the surface this may seem odd - how could strains evolve further away from their parental species (bruxellensis) in a brewery over a few centuries than a completely separate species (anomala) that diverged tens of millions of years ago? There are several possible answers to this, none of which I can conform or eliminate at this juncture:
  1. Conventional brewing yeasts have undergone huge changes over the past few centuries, including inter-species crosses (which appear to have led to lager yeast), partial duplications of large segments of their genomes, etc. Its possible that these semi-domesticated Brettanomyces  underwent a similar case of rapid evolution in the brewery, possibly including hybridization with other Brettanomyces species.
  2. The rRNAs of the brewing strains may have evolved faster than the remainder of the genome, meaning if we compared other regions we may see lesser change. This is a potential issue with using only one gene to establish the evolutionary relationship between species. Indeed, the sequenced region is known to be highly variable, which is why we sequenced it in the first place.
  3. I may have contamination. I doubt this - mixed sequences usually create unsequencable results - but its a possibility to keep in mind. The fact I get 100% matches against segments of the species these strains are known to be derived from suggests that contamination is unlikely.
This is a far-cooler result than I had anticipated; but what I was hoping would be a simple method of identification ended up being more complex and involved than anticipated.


Next Time

I'm a believer in tweaking only one thing at a time - this time around I changed the primers I was using. But , in my readings, I found that with yeast the extensive DNA isolation I was doing was unnecessary - turns out all you need to do is collect a small amount of yeast - literally 2-3ul from a liquid culture or the amount that sticks to the end of a toothpick touched to a colony on a plate. This is then diluted in 15ul of water, and lysed by heating to 96C for 10min. Much easier, and will be added next time. This may also solve the issue I've had with no PCR products appearing in some samples - assuming this issue is due to the use of an overly harsh DNA isolation protocol.


References

  1. Curtin CD, et al. (2007) Genetic diversity of Dekkera bruxellensis yeasts isolated from Australian wineries. FEMS Yeast Research. Vol 7 #3.
  2. Boekhout T et al. (1994). Phylogeny of the Yeast Genera Hanseniaspora (Anamorph Kloeckera), Dekkera (Anamorph Brettanomyces), and EenieZZa as Inferred from Partial 26s Ribosomal DNA Nucleotide Sequences. Int. Jour. System. and Evol. Microbiol. Vol 44 #4.
  3. Guillamon JM et al. (1998) Rapid identification of wine yeast species based on RFLP analysis of the ribosomal internal transcribed spacer (ITS) region. Arch Microbiol. Vol 169.
  4. Saccharomyces cerevisiae strain YJM789 complete ribosomal sequence
  5. Saccharomyces cerevisiae 25S ribosomal RNA gene, complete sequence
  6. Dekkera bruxellensis 26S ribosomal RNA gene, partial sequence

5 comments:

  1. (posted by eurekabrewing.wordpress.com; I can't post with wordpress somehow)
    Bryan, thanks for sharing your results. Interesting results indeed. I had to do some in-silico analysis with your data myself. May I use your two WLP yeast sequences for a future post of mine? Would make it easier to talk about some of my results as well. I have some other questions.

    - Troubleshooting the failed amplification for the Saccharomyces strains. Did you use the same amount of DNA for all the four PCRs? (Measured the DNA concentration).

    - What seq alignment algorithm did you use to compare the sequences (to get the pairwise identities for B. naardenensis)?

    - Have you trimmed your sequences before building the phylogeny tree? I aligned your sequences a 26S rRNA multiple sequence alignment and trimmed the overhangs in your sequences. Just to get better multiple seq alignments for better/faster phylogeny trees later on.

    - Have a look at one of my trees... (http://www.ebi.ac.uk/Tools/services/web_clustalw2_phylogeny/toolresult.ebi?tool=clustalw2_phylogeny&jobId=clustalw2_phylogeny-I20130513-193853-0176-18004711-pg) What are your thoughts about that?

    - Concerning the electrophoretic shift assay. The 5th edition does not show the table from above anymore. Simply because this method has been criticized since it is based on proteins rather than DNA. Like to cite Kurtzman et al (The Yeasts, a Taxonomic Study, 5th edition): "1. The method assays the genotype only indirectly, so that much variation at the nucleotide level may go undetected because nucleotide substitutions do not necessarily change the amino
    acid composition; 2. Changes in amino acid composition do not necessarily change the electrophoretic mobility of the protein and, as a consequence, alleles that are considered to be the same protein alleles from different individuals may represent different gene
    alleles"

    I haven't compared all my phylogeny trees with the mobility table above. I therefore don't want to make any conclusions yet if the (DNA based) phylogeny trees represent the mobility phylogeny relationships.

    Cheers, Sam

    ReplyDelete
  2. Sam, everything here is open to the public; no need to ask permission to use anything (a link-back would be nice though...). I'll try to be brief in my answers.

    The DNA isolation method I used is crude, so I cannot spec the DNA and get an accurate reading. I suspect there is little-to-no DNA in the samples that didn't work. As I say at the end of this post, I'm planning on trying another method that avoid the DNA isolation all together.

    All my analysis was done using NCBI nucleotide blast, with the mentioned species filters, aside from the cladogram.

    For my tree I trimmed the sequences down to the D1-D3 variable regions. Otherwise you get a massive bias towards the constant region and "false" alignments with things like picha.

    As for your tree, it perfectly reflects the issues I was having - if you don't take into account the varying sizes of the rRNA sequences in the databases, you get unusual clustering of the sequencing results driven by the hypersensitive of longer/partial matches to score higher than short/perfect matches. It is possible to 'tweak' the clustering algorithms to more strongly select against mismatches, but I'm not familier enough to do that properly. Instead I limited my trees only to the D1-D3 variable region (including the non-variable bits between the D1/D2 and D2/D3). The resulting region is ~300bp in size.

    You are correct that electrophoretic shift assays are no longer used - but keep in mind that they tend to underestimate evolutionary divergence. Synonymous mutations, and mutations which substitute amino acids with similar chemical properties (i.e. small hydrophobic for small hydrophobic) tend to be missed.

    ReplyDelete
  3. oops. Where I wrote 'driven by the hypersensitive of longer/partial matches to score higher than short/perfect matches.'

    I meant

    "driven by the perpensity of the alignment algorithm to score higher for longer/partial matches than short/perfect matches."

    ReplyDelete
  4. I would for sure link to your site. I mean, it is your data and I don't want to get credit for that. In addition, I always mention my sources whenever possible. I just like to ask first before using other's data/results.

    I would like to compare some phylogeny trees and compare them with the shift assay. It seems logical to me that one underestimates divergences on protein level. I am kind of interested now to see if differences are present in phylogeny trees (based on 26S rDNA) and the shift assay.

    If I recall correctly, the sequencing results based on the ITSs gave you some Pichia hits as well as for the 26S rDNA. In Yeasts, chapter 13, Kurtzman et al mention the conundrum of assigning Dekkera to a different species (or one of its own). Interestingly, D. bruxellensis appears in the Pichia clade if one constructs a Ascomycetous yeast tree. However, in phylogeny trees with Pichia and other Pichia-close related species (and Dekkera), Dekkera appears as a distant clade. What if Dekkera (or some of the strains) are hybrids of yeast species X and Picha and therefore show similarities to Picha sp. 26S rDNA (because the 26S rDNA is from the Pichia species)?

    Would be interesting to know if the two White Labs Brettanomyces strains originate from a yeast database (and therefore would have a identifier like CBS XX).

    I guess there is a very interesting answer why the two White Labs strains show such a divergent 26S rRNA sequence (and ITSs). Cheers, Sam

    ReplyDelete
  5. RE: Pichia. The high rate of "false positives" I get for pichia is (IMO) exactly for the reason you list - brett & pichia are very close, and may even intermix.

    That there is a high degree of relatedness is apparent in my samples - the matches between Pichia (WLP653) and Meyerozyma/Pichia (WLP650) are perfect matches in the constant region of the 26S rRNA. Given the closeness of the species, the perfect match in the constant region is what we'd expect.

    But there are also clearly differences - if you blast the full WLP650 or 653 sequences, despite getting 100% matches with Meyerozyma/Pichia, there's consistently ~300bp of the input sequence that is cut off of the results. This "missing" section is the D1-D3 variable region (basepairs 1-300 in the 653 example, 600-end in the 650). If you BLAST the D1-D3 regions (and nothing else) you get the 'correct' match.

    I believe that the WLP strains are from a brewery, but they may be in a strain library as well.

    ReplyDelete