Senior Scientist, Dewpoint Therapeutics
|Type||Kitchen Table Talk|
On September 28, the Dewpoint scientists welcomed brilliant scientist and my long-time friend Alex Holehouse for a Kitchen Table Talk. Alex started his academic career in the UK, where he got biochemistry and computer science degrees from Oxford University and Imperial College, respectively. This unique set of skills primed him for his PhD and postdoc work in Rohit Pappu’s lab, where he moved from systems biology to modeling disordered proteins and emergent properties of condensates. Alex is now an assistant professor in the Department of Biochemistry and Molecular Biophysics at Washington University School of Medicine in St. Louis. He and his lab work to solve the complex questions surrounding how function is encoded in proteins, specifically disordered regions (a recent example can be found here).
In the video, Alex describes how answering these questions has required a suite of novel bioinformatics and molecular simulation tools. He details his motivation behind creating each tool and explains how they can interrogate and connect questions ranging from proteome-wide bioinformatics to complex condensate dynamics. Further, Alex provides a guide for how scientists outside of his lab can use the tools to approach their own research questions. Through his many collaborations—and I feel fortunate to be one of them—Alex has contributed to much of the seminal research that has shaped the biomolecular condensate field. Enjoy the video below and check out the tools section of the Holehouse lab website for more information about the methods he discusses (among others).
Create an Account or Sign In to view the video.
Erik Martin (00:00:00):
Hello. It’s my pleasure to introduce Alex Holehouse today. To give a brief background, Alex started his academic career in the UK where he collected a series of master’s degrees–first in biochemistry at Oxford University and second in computer science at Imperial College, which, frankly, I always assumed was to work on whatever the 2010 version of Animal Crossing was. But yes, it actually turned out it was modeling cell system biology.
Erik Martin (00:00:34):
I think from there, if anyone’s paying attention, this gave him the perfect background to move on and do his PhD in Rohit Pappu’s lab, where he moved from systems biology to modeling disorder proteins and emergent properties of disorder proteins in condensates. And we kind of like to joke for a long time that you’d go to a conference in the disorder protein or condensate field and literally every talk would mention either Rohit or Alex at some point in time or another.
Erik Martin (00:01:03):
On a personal note, I would, I guess, like to say that not only are Alex and I wearing nearly the same shirt today, but we’ve known each other for quite a long time. I was going through a series of talks to try and figure out when it was exactly I met Alex. Best I can figure it was winter of 2014. And from there we started a series of very fruitful collaborations, which not only was scientifically interesting but was personally very rewarding for me. So it’s my pleasure to introduce Alex, who’s now an assistant professor of biophysics at Washington University in St. Louis.
Alex Holehouse (00:01:44):
Erik, thanks very much for the kind introduction. Yes, it was winter of 2014, many moons ago. So thank you to Dewpoint firstly for inviting me to speak in this series. I was saying to Jill before, I think the value of having these recorded interactive talks on this topic is really valuable for the whole community regardless of a pandemic or not.
Alex Holehouse (00:02:06):
Today I’m going to give a slightly different talk to the talks I normally give. I’m going to talk more about our tools and our methodologies. And typically when I give a talk, I like to tell a story. I like to explain a biological observation and think about what that means and how we think about that. I’m not going to do that as much today. I’m going to talk instead more about the practical components of how and why we built certain tools, what they do and our goals in doing that…
Alex Holehouse (00:02:30):
But for a brief bit of context, I’m summarizing something that I think we probably all are familiar with, but I’ll just reiterate, which is that disorder proteins are structurally heterogeneous, they’re ubiquitous, and they’re really important for cellular function. And this structural heterogeneity is sort of illustrated by this movie here, of a simulation. The key idea is that even though they lack this fixed three dimensional structure, which is how we often think of defining disordered proteins, that’s not to say that they are entirely unstructured. There are sequence-specific conformational biases that are encoded by the amino acid chemistry that’s defined along the polypeptide backbone. And those biases give rise to specific long-range local and short-range interactions that give ensembles certain flavors and certain behaviors.
Alex Holehouse (00:03:12):
And so with that in mind, the way we as lab tend to think about disordered proteins and their function is it that function is defined by a combination of that amino acid sequence and the resulting ensemble that the sequence encodes. And so the general paradigm in folded proteins is that we have this sequence structure functional relationship where the amino acid sequence encodes a three dimensional structure of the protein and then that protein functions based on that three-dimensional structure itself. In disordered protein we have something analogous, which is the sequence-ensemble-function relationship. That is that the linear chemistry encoded along the polypeptide backbone determines that conformation ensemble and that then plays a role in determining the function. But it’s not the only thing, certainly. And so the goal of my group broadly defined is to ask, can we decode this relationship? Can we understand how sequences go to function? And we think that thinking about ensemble is an essential or important component of this linear decoding.
Alex Holehouse (00:04:11):
And so we do this in a few different ways. What I’m going to focus on today is our computational methods development, and we develop tools in the context both of sequence based things, but also molecular simulations as well. We also do a lot of molecular simulations, whether it’s coarse-grained and all-atom simulations. And we use those simulations in principle as a way to map between sequence and ensemble. I’ll talk a bit about some of our work in that area. And then in my lab we use saccharomyces cerevisiae as a model system for asking questions of function. We like yeast for a number of reasons. But we collaborate heavily with many groups around the world because we’re interested in these general principles. And I think if we have uncovered general principles, those principles should hopefully be applicable in many different contexts.
Alex Holehouse (00:04:52):
And so broadly, given the goal of my group is to map the sequence of functional relationship, we really need to understand a biophysical, mechanistic basis how that sequence can actually encode function. And I’ve introduced this idea of sequence-ensemble function, the idea that the three dimensional biases in that ensemble can be and are in many cases important for function, but they’re not the only things. I think one thing that has been heavily studied by many groups is this idea of short linear motifs. These are sequence specific interactions encoded by a given order of amino acids. So these are not shuffleable. There’s a specific set of residues that engage often, but they’re not strictly necessarily with a folded binding partner. And these can encode recognition and molecular recognition motifs in a variety of different contexts.
Alex Holehouse (00:05:36):
The other, I think, something that’s been really brought to the front by work from Rohit Pappu, Tanya Mitag and many others, is this idea of there being this underlying chemical biases in these sequences as well. So maybe there are not specific amino acid sequences that are required, but there are chemical features that might be encoded and might be conserved across evolutionary scales. And so thinking about these three things in conjunction, I think the way that we tend to think about it is that some combination of these things are going to matter for function.
Alex Holehouse (00:06:04):
And depending on the protein and depending on the role, it may not be one specific thing, it may be a combination. These two are intrinsically very closely linked. The types of chemistry we see will influence the ensemble, but there can be examples where you have situations where just a motif is required and the ensemble is maybe less important. And so I think one of the challenges for us in the disordered protein field is that there’s not necessarily a one-size-fits-all relationship between sequence and function. For folded proteins, we have this crutch of folding. We can take advantage of the fact that we sort of know that if it is a folded protein, it probably has to fold to work. Whereas we don’t have that luxury in the context of disordered proteins. And so this is both I think a challenge and an opportunity to think about how we can learn from physical chemistry to understand function in molecular biosciences.
Alex Holehouse (00:06:49):
So the question I want to raise today is how much of this mapping can we do computationally? What we would like to be able to do in a perfect world is from sequence perfectly predict all of the functionally important features. And what that sort of looks like would be an ability to know what those features are going to be a priori. And that’s a difficult thing to know. And I think that’s one of the challenges we face is we can’t necessarily know what things are going to matter upfront.
Alex Holehouse (00:07:12):
So to address this, one of the things my lab has been doing over the last couple of years is building this ecosystem of computational tools to try and help us understand this sequence-ensemble and sequence-function relationship. And so these tools have involved very mundane things like tools for parsing protein fasta files. This seems like a really boring thing, but actually doing this robustly and then quickly and efficiently is quite important. Tools for doing deep learning for mapping between sequence and arbitrary annotations. Tools for predicting disorder, tools for working with large complex data sets, tools for designing IDRs and tools for predicting sequence features as well.
Alex Holehouse (00:07:44):
Today I’m going to focus on a small subset of these, but I want to make the point that we build these tools, not just because we think building tools is fun, but because we have questions we want to answer. And one of the features I think of being in this interface between computational biology, computer science, and wet lab and experimental biophysics is that we can identify questions that we don’t yet have a way to answer and we can build tools to let us answer those new questions. And that’s something that we try and lead into across the different areas of work in the lab.
Alex Holehouse (00:08:16):
So our goal really is to enable the computational interrogation of IDRs from sequence. So what this looks like at a very practical level is we’d like to be able to take a polypeptide sequence, we’d like to be able to predict where those IDRs are, we’d like to do that quickly and efficiently. And then we’d like to be able to take those disordered regions and make functionally important predictions of features they’re in. And this again gets back to the challenge of we don’t necessarily know what those functionally important features are going to be a priori, but there are certain things that through both extant literature and large scale sequence analysis we think might be important. And so I’m not going to talk so much about this actual prediction side of things today. This is stuff we’re working on. We’re going to talk a bit more about the practical considerations of how we get to be able to make these predictions.
Alex Holehouse (00:08:57):
As a final note, we build computational tools so that you don’t have to, And what I mean by that is our goal in building these tools is not necessarily to build things that are just for us. We build things using industry standards in the context of software engineering. We use continuous integration, version control, documentation. And so our goal is to position other people to easily take a sequence that they’re interested in, parse it into some or some subset of our tooling, and then use that to make advances in their own questions and their own challenges.
Alex Holehouse (00:09:26):
And so for everything–bar the last section I’m going to talk about today–this is all either pre-printed or published and there’s documentation, there is other information here. And you can just go to our lab website under the tools heading. And all of the relevant things that you might want to get will be linked from there. So I’m not going to put individual links up as we go through.
Alex Holehouse (00:09:45):
So with that introduction in mind, I’m going to talk about four short vignettes today reflecting distinct aspects of the work that we’ve been doing. The way I’m going to do this is I’m going to talk about each one of these. I’m going to identify the problem we’re trying to solve, how we solved it. I’m going to talk a little bit about what we did and how we did it. And then I’m going to stop and take questions on that particular tooling. And of course you should feel free to ask questions on anything throughout the talk, but that way we can address things that come up with specific parts of the content in situ as opposed to waiting until the end.
Alex Holehouse (00:10:19):
So the first problem that we faced when I started my group was we wanted to be able to quickly, accurately, and effectively identify disordered regions. And in many ways, this is a solved problem. Disorder predictors are one of the earliest things that happened in the protein disordered world. We know there’s obviously this relationship between disorder and condensate formation. I’m going to talk a bit about that in a minute. But while there’s been a ton of work in this space and a lot of people have done a lot of fantastic things, one thing we noticed when we started is that many of the top performing predictors had two problems. And one of them was that they were just very difficult to install. You had to download hundreds of gigabytes of sequence databases for cross referencing. They required compilation of a variety of different bits of software. They weren’t particularly portable. So that was one problem. And the second problem was many of the most high performing tools were slow. They were taking minutes to hours per sequence in some cases.
Alex Holehouse (00:11:13):
And so the questions that we wanted to ask as a lab were really focused on these sort of large-scale proteome based analyses. And if you’re taking days, weeks, months, years to predict proteome scale disorder, that’s not really compatible with the workflow that we wanted to do. And so we set out to ask, well, can we build something that’s going to be easily usable across different systems–across Linux, Mac, Windows, that’s going to be easy to install, that will be lightweight, won’t require a whole bunch of things to download, and will work quickly and efficiently and accurately.
Alex Holehouse (00:11:42):
And so to address this problem, we built this predictor Metapredict. This is work that was done by Ryan when he was a graduate student and now a postdoc. I’m going to present today our refined version two of this. We have an initial version that’s detailed in this paper here. And then we have this kind of ever existing as a preprint paper right here talking about this updated version that I’m going to talk about now.
Alex Holehouse (00:12:05):
So how does Metapredict work? So Metapredict basically takes advantage of two related but distinct inputs. So we use two bits of information for predicting disorder using Metapredict. We use consensus disorder. So consensus disorder is just taking a whole bunch of existing disorder predictors and asking what fraction of them or how many of them predict a given residue to be disordered. So this set of data is asking, do we think a residue or a region is going to be disordered? Yes or no?
Alex Holehouse (00:12:31):
And then we also integrate data trained on Alphafold protein structure. This is a separate neural network that we’ve built that, in principle, predicts how likely we think Alphafold2 is going to be at predicting if that region has a structure, how confident it’s going to be. And these things are related, but they are not the same thing. This one here is asking can we predict structure? This one here is asking do we think it’s disordered?
Alex Holehouse (00:12:54):
And so what we do is we combine these two things with the idea of saying, well, they tell us distinct but related things. We combine them into a single network with a pair of hidden layers here and then we train that to create a single predictor that takes in sequence and has information from both of these different sources in it. And so the consequence of this then is we have something that, in principle, should be good at predicting both structured regions and disordered regions as well.
Alex Holehouse (00:13:18):
Metapredict gives you two types of information. So you get the per-residue disorder score, which is common to most disorder predictors, and you also get information about contiguous disordered regions or IDRs. And so this becomes useful if you want to break your protein down into folded and disordered bits. You don’t have to do that, you get that out for free basically.
Alex Holehouse (00:13:36):
So when we compare accuracy of Metapredict against data where we know both the assignment of a given residue for either it being structured or disordered, we do reasonably well. This is just ranking based on accuracy data from the original CAID competition that happened, I guess published earlier this year I think. And so we’re sort of in this top half here. I think it’s probably worth saying that most of the things up here are pretty good. We’re talking about out of a hundred residues getting between one and two residues wrong in this region here. So yes, we’re up here, but I think broadly defined anything in this top third is pretty good.
Alex Holehouse (00:14:14):
This is all looking at natural proteins. One thing we were worried about when we were building this is how well will we fare against unnatural proteins, synthetic proteins? And so we can do that by taking advantage of synthetic proteins designed by David Baker. These are just proteins that were developed by a deep hallucination network. We predict the folded bits to be folded and we predict the disordered His tags to be disordered, which is sort of reassuring. So this is not an annotation, this is a prediction and we kind of get this boundary between the initiating methionine residue and the His tag correct to the residue in both cases where it’s present. And then there’s not one in this protein and so we don’t see it there.
Alex Holehouse (00:14:45):
We can also take completely randomly generated sequences. This is this really nice paper from a couple years ago, basically building randomly shuffled polypeptides and synthesizing them and asking them do they fold or do they not fold? And if we do this, we get most of the things that are disordered by circular dichroism correctly, and we get all of the things that are ordered by circular dichroism correctly as well. So this gives us confidence that we’re not over training on natural proteins. And actually this should be relatively effective for completely synthetic polypeptides as well.
Alex Holehouse (00:15:16):
Metapredict is easy to use and widely available. So if you’re a Python person, you can install it using pip, it’s available as a Python module. It is this easy to predict disorder. It’s available as a command-line tool or a collection of command-line tools if parsing in fasta files. If that’s not your jam, we also have web servers. So we have a general web server that you can go to and you can paste your amino acid sequence into a box and you’ll get nice graphical representation of where disordered is as well as the actual disordered regions as well.
Alex Holehouse (00:15:41):
This works for single sequences. If you have many sequences, we also have a Google colab notebook where you can upload a fasta file and predict tens, hundreds, thousands, tens of thousands of sequences online in your browser. And so this is sort of a nice option if you want to do large scale protein disorder prediction but don’t want to have to deal with the resources locally.
Alex Holehouse (00:16:01):
The final thing to say is that Metapredict is pretty fast. So this is comparing execution times across those same set of predictors I showed before. Metapredict here is labeled in blue, and this is a log scale. So the state-of-the-art, the very best predictor takes about a month to process these 600 sequences, and Metapredict takes about a minute to do it. So that’s quite a big difference. And this is really something that we wanted to get because the workflow that we tend to operate with is interactive. We will be working with sequences, we’ll parse them through. And so having to pre-compute these things and wait a long time to pre-compute really gets in the way of an agile approach to analysis.
Alex Holehouse (00:16:34):
What does this mean in terms of real numbers? If we take the human proteome, it’s about 23,000 proteins. Using our Google colab notebook, so this is using Google’s resources, not your resources. It takes about 45 minutes to predict all of the IDRs in the human proteome. If we do this locally, this is just on my laptop for example, we don’t need any fancy hardware, no GPUs, no particular instruction sets. This takes about 13 minutes to do the human proteome. And so this sort of scale then I think opens up things that might previously have been difficult to do with an accuracy that it would’ve been challenging to achieve using existing tools, at least when we started.
Alex Holehouse (00:17:09):
I’m going to just briefly stop there and ask if people have questions about Metapredict specifically. And if not, then I’m going to jump onto the next little bit. And I think people can unmute themselves if they have questions to ask.
Jill Bouchard (00:17:20):
I don’t see anything in the chat yet.
Alex Holehouse (00:17:27):
All right, that’s cool. That’s fine.
Sangram Parelkar (00:17:29):
Yeah, I do have a question.
Jill Bouchard (00:17:30):
Sorry, we’ve got one at the table though. Go ahead.
Alex Holehouse (00:17:32):
Sangram Parelkar (00:17:33):
I was curious about epigenome. For example, there could be epigenetic manipulation of a particular protein that might affect the IDR. So when doing risk predictions, do you take that into account? Possibly cell type, like cancerous/non-cancerous, or where the proteome could have an effect?
Alex Holehouse (00:17:52):
Yeah, so what we are doing, so the question that this is answering is, given the polypeptide sequence parsed in, do we think it’s going to be disordered or not? So you can imagine there being effects of, for example, post-translational modifications, effects of other proteins that are there, effects of changes in the solvent environment for example, depending on cell type, metabolomic profiles, things like that. We don’t take those things into account. In part, because we don’t really have any way to know what that’s going to be. We haven’t actually done it with phosphorylation. We generally don’t think … There are some really elegant exceptions. Alaji Bah when he was in Julie Forman-Kay’s group has this incredible paper showing phosphorylation driving folding of one of the eukaryotic initiation factors.
Alex Holehouse (00:18:35):
But in general, we don’t think about post-translational modifications dramatically altering disorder or lack thereof. Although I suppose as I’m saying that I suppose I don’t have great examples of why I think that’s true. But in general, the residues that get modified prior to modification are happy being disordered regions, and post modifications are also happy being in disordered regions. So it’s not typically these dramatically alter the physical chemistry in a way that shifts disorder. They alter the physical chemistry, but not in a way that necessarily makes them more or less prone to be disordered. I don’t know if that fully answers your question. I think the broader gamut of cellular changes that could influence things is a really interesting question, but it’s not something that we really have any way to know about in this context particularly.
Sangram Parelkar (00:19:25):
Sounds good. Thank you, Alex.
Alex Holehouse (00:19:26):
Okay, so with that in mind, I’m going to move on to two other shorter vignettes talking about some of the other tools we’ve been developing. And so one of the big problems that we faced actually as a lab for a few years now is this realization that there’s actually many different types of data available. And what I mean by this is that there are gigantic, very expensive, very effective high-throughput experiments, large computational studies that have annotated proteomes with a wide variety of different things. So things like folded domains, things like crosslinking through mass spec, post-translational modifications, mutations.
Alex Holehouse (00:20:04):
Sometimes we analyze those things in isolation, but often we’d like to be able to ask how these things relate to one another. So for example, how does phosphorylation relate the disorder, as an example. And doing this actually becomes logistically quite tricky. If you have large data sets that are complex, that are heterogeneous, that are different types of data. So you can imagine post-translational modifications map to specific positions on the sequence, whereas domains map to a region on a sequence.
Alex Holehouse (00:20:28):
And so this becomes logistically difficult to do at scale if you’re trying to integrate in large data sets that have multiple different types of annotations. And so to address this, we developed this Python framework called SHEPHARD. And SHEPHARD really gives you a way to wrangle large proteome-scale data sets. Although, obviously it can be used for smaller data sets, in a way that makes doing integrative analysis easy, straightforward, and very quick to do.
Alex Holehouse (00:20:53):
So to keep this sort of high level and not get too down into the weeds, the way that SHEPHARD works is you can take a polypeptide sequence, you can read it into SHEPHARD in a variety of ways, and it gets represented as a protein object. And so proteins can be annotated with different types of annotations, they can have domains, they can have sites, they can have tracks and they can have attributes. And domains could be things like disordered regions or folded regions. Sites could be things like mutations. Tracks could be per-residue hydrophocibity. Attributes could be copy number or functional annotations, for example.
Alex Holehouse (00:21:21):
And so this would be the general structure you’d expect for a single protein. Of course, we’re interested in large collections of proteins. So you can read in many proteins into what we call a proteome. Each proteome has one or more proteins, then each protein has one or more domains, tracks, sites, or attributes. And the key thing is that these all know about one another. And so you can ask questions at whatever level of resolution you want to ask in a way that’s relatively easy to do.
Alex Holehouse (00:21:45):
Where do these annotations come from? They can really come from anywhere. We’ve built this interface that lets you, in a very kind of simple, well-defined file format that you can build using Microsoft Excel, if that’s your preferred weapon of choice, to create annotation files then can be read into and annotated onto these proteomes. And so our goal is to make it as easiest possible for both people to read information into SHEPHARD, but also the annotation files themselves can be opened in Excel or in a text editor and you can look at them. And so our goal is to create something that is both robust but also accessible even to people who are not coming from a computational background and maybe don’t have a real desire to think about things like JSON formats, for example, or more complex, perhaps more robust, but more complex formats that are typically used within the computer science world.
Alex Holehouse (00:22:31):
The real advantage of all of this is that it lets you put in a position where you can then analyze a proteome scale using something that looks basically like native Python. So you can use syntax that feels like you’re just iterating through the protein, for the domains in a protein. And you can do things that are very, very simple. And so in this sort very, very simple syntax, you can use kind of Pythonic ways to interact with the data structures that make it easy to write readable code and readable analysis code that can be shared with many different people.
Alex Holehouse (00:23:00):
And so just as an example, I want to quickly share a couple of things that we’ve done with this. So we’re interested in this idea of there being a relationship between the sequence chemistry and the molecular function of IDRs. Obviously, in folded proteins we’re familiar with this idea of there being structure and function, but maybe there’s something about the chemical properties. And various people, Alan Moses for example, have done really elegant work proposing that this may be something that is true.
Alex Holehouse (00:23:22):
So we just parse through the human proteome. We were interested in different charge-rich proteins. You can see that actually if you divide IDRs up into proteins with IDRs that are either arginine-enriched or lysine-enriched, there’s this really striking separation of function where the proteins with arginine-enriched IDRs tend to be enriched for RNA-binding function, whereas the protein with lysine-enriched IDRs tend to be enriched for DNA binding. And this sort of division of labor here is, I think, fairly strong. The overlap is fairly low and sort of implies that these two types of residues have quite distinct flavors of chemistry that they prefer to interact with.
Alex Holehouse (00:23:59):
We can do something similar if we look at polar-rich low complexity domains. So these are regions that are enriched only for glutamine, glycine, serine, threonine, and asparagine. And if we do that and then ask for those polar-rich low complexity domains that are also enriched for only one other type of chemistry. So aromatic residues, charge residues, or aliphatic residues, we again find there’s almost non-overlapping set of proteins with distinct functional annotations.
Alex Holehouse (00:24:24):
So for example, those aromatic-rich polar LCDs, which many people have studied before are often involved in RNA binding but also subcellular organization. Whereas we look at aliphatic-rich polar low complexity domains, we find almost exclusively things involved in transcription regulation, either transcription factors, chromatin modification, but not typically RNA-binding proteins, for example. And so this again implicates the fact that there is an interpretable molecular chemistry or molecular grammar that relates IDR sequence to IDR function.
Alex Holehouse (00:24:54):
SHEPHARD again is easy to use and widely available. You can install it using pip. This is just showing in Python code what it would look like to read in the human proteome and then annotate with a set of IDRs. So you can very quickly go from nothing to having a large proteome object to interact with. We’ve built this to be as easy as possible to quickly start doing large-scale analysis.
Alex Holehouse (00:25:13):
With that in mind, we also provide a pre-compiled and pre-annotated version of the human proteome as a Google colab notebook. So if this is something that’s of interest, you can go to this colab notebook now and you can start doing proteome wide bioinformatic analysis in your browser without needing to download or install anything. This is all pre-computed, takes maybe 60 seconds to install and annotate everything on Google’s end. And then you can actually start writing codes to do these large scale integrative analyses that prior to this would’ve taken us quite a long time and now can be done in a few lines of code in the browser.
Alex Holehouse (00:25:47):
So with that, I’m going to briefly again pause and just see if people have questions about this SHEPHARD tool. There’s a lot of things that I could talk about with this and maybe a more extended discussion might be better until the end. But if there are burning questions right now, I’m happy to take some.
Jill Bouchard (00:26:03):
Burning questions? Guess not. Take the reins, Alex.
Alex Holehouse (00:26:09):
So the third thing I want to briefly touch on–and then we will get to condensates, I promise–is this question of if you have a large data set either experimentally defined or computationally defined, or you have many, many sequences that match to some annotation. So in an assay that could be fluorescence, that could be viability, that could be expression. Sometimes we’d like to be able to understand, what about the sequences gave rise to that particular readable or interpretable phenotype? And sometimes this is really obvious, but often it’s not, especially with disordered proteins. It is not necessarily clear why certain sequences give rise to certain phenotypic consequences.
Alex Holehouse (00:26:46):
And so to address this, we built this tool PARROT. And PARROT is a deliberately simple way to use deep learning to map between amino acid sequence and arbitrary protein annotation, either at the per-protein level or the per-residue level. And so as a quick bit of background, we really built PARROT to make it easy for people who are not from the machine learning or deep learning world to start doing these types of analyses.
Alex Holehouse (00:27:12):
So I think it is worth saying PARROT is not doing anything particularly … If you’re an experienced deep learning person, there’s nothing PARROT’s doing necessarily that you couldn’t do now. I think the big difference is it makes doing this really easy. It makes doing it easy in a way that sort of hides a lot of the complexities that you need to think about in terms of trading test validation splitting and various other things behind the scenes. So that things will just by default work out the box.
Alex Holehouse (00:27:40):
And we provide, I should say, on our documentation, we provide a set of resources aimed specifically at people who are not coming from a machine learning or deep learning background to help provide some context as to what we’re trying to solve and the kinds of problems you can face. So at a very high level, PARROT uses this long short-term memory bidirectional recurrent neural network architecture, which is a type of deep learning model that’s been used historically in natural language processing. It has some nice features from a computational standpoint. So the memory footprint’s relatively low and it works reasonably well on both CPUs and GPUs.
Alex Holehouse (00:28:13):
You can of course do everything I’m going to describe using GPUs and it’ll be much faster, but we often use just CPUs for things and it’s very usable. So that’s not strictly a requirement. The general workflow that we suggest for PARROT is relatively straightforward. So the input file you have looks like this. It’s just three columns separated by a space. The first column is the sequence identifier, the second column is the actual sequence, and the third column is some annotation associated with that protein sequence.
Alex Holehouse (00:28:40):
And this is what the file would look like if you were doing a sequence to one annotation mapping. If this was a sequence to every residue, then you would just have as many numbers after the sequence as residues in the sequence. And each of those would map back to one value per sequence. You can have large sets of sequences where I’m currently on my work computer training a model with half a million different sequences, but you can use much smaller data sets as well.
Alex Holehouse (00:29:07):
And you take this data set and you feed it into our command-line tool parrot-train. And this just basically involves passing in this file, giving it the output into the network, telling that this is a sequence problem. That means there’s going to be a single class because we’re doing regression and now we’re training a network. And that’s really all there is to it. And this will start training. Epoch is basically how many times you’ve run through the data. It will print out the loss, so how well it’s doing, how well it’s performing. And then once it’s finished it will print out this information and give you a bunch of other output in terms of how well things have worked in comparison to your test and validation data sets as well.
Alex Holehouse (00:29:43):
The output from this is this model file, and this model file is just a data file that has a bunch of numbers in it that tells the network how to weight different nodes or different components of this underlying network. And then once you have that model file, you can take a set of unknown things. So this is almost exactly the same format as the input file I just showed you except there’s now no annotations. So we just have unknown sequence indices and then the sequences, but we don’t know what the values of these sequences should be.
Alex Holehouse (00:30:10):
And you can take that model file I just described, you can pass it into parrot-predict, you can give it the unknown sequences, the model file and you can give an output and you run this. And basically from doing that, you get out a file that looks like this where now you have these predicted annotations for each of these sequences as well. And so the nice thing about this is that in a couple of commands you can go and train a relatively effective deep learning model to predict arbitrary things. And so we’ve used this for a wide variety of problems and a wide variety of systems and it works reasonably well. There are certainly types of problems where it maybe will struggle more than others, but by and large, if the data set is large enough, you can generate very accurate models for predicting things. And actually, using PARROT is how we built Metapredict originally as well.
Alex Holehouse (00:31:01):
Just as an example of what these models actually do and how well or poorly they perform. What I’m showing you here is just an example from the paper, which is Dan went and trained a model using existing phosphosite data and then compared it to a set of different state-of-the-art phosphosite predictors. And so this MCC score, this Matthews correlation coefficient, basically the higher the value, the better the predictor is done. You can see here PARROT in blue has generally equaled or outperformed the state-of-the-art phosphosite predictors, certainly for serine and threonine, maybe a little bit worse for tyrosine. Perhaps that just’s a function of not having as much data.
Alex Holehouse (00:31:35):
But the reason to show this is not to say we’ve made an amazing phosphosite predictor. I’m sure there are many things we could do to make it better. And I’m sure there are things that these tools could do that would also potentially render them more accurate as well. I think the bigger reality is that these tools–PHOSFER, MusiteDeep, PhosphoSVM–are tools that people have built for a specific purpose, whereas we’re taking PARROT, which is a completely general purpose tool, and able to create something in an afternoon that equals those performances as well.
Alex Holehouse (00:32:03):
And so from our perspective, this has really opened the door for asking types of questions that we previously didn’t really think we could answer. And especially one thing that we take advantage of is if you can generate data from other slow predictors, you can train a PARROT predictor on those slow predictors and you can create a fast predictor. And so we use this fairly extensively in the context of building these kind of high-throughput tools that I’ve discussed earlier.
Alex Holehouse (00:32:28):
With that, I’ll just stop here before I get into the condensate stuff and ask if people have questions about PARROT or about this kind of deep learning stuff. Again, this is all published and is online and if you have issues you can raise issues on GitHub as well.
Alex Holehouse (00:32:48):
All right, well having a sip my coffee, the last thing that I talk about is unpublished stuff. And this is unpublished stuff that’s based on simulations in the context of looking at condensates and disordered proteins. And so the problem we’re trying to face here is we’d like to understand how IDR sequences influence both conformational behavior and phase behavior. And we’d like to do that in a way that’s easy and high-throughput.
Alex Holehouse (00:33:12):
So graphically the sequence-ensemble relationship is something that we can do in a few different ways, both experimentally and then computationally. Similarly with a sequence to phase behavior, we can do this in a few different ways. And we’d like to find a way to do this where it is sufficiently quick but sufficiently accurate, that it becomes easy to play around with. And I think this is one of the big things that at least from our perspective has been useful is if tools are sufficiently fast and sufficiently easy to use, then the barrier for trying stuff is really low and trying stuff is a really good way to learn and build intuition. And so if you have to spend a whole bunch of time editing complex data files and trying to figure things out, I think that creates a barrier for a lot of people that they’re not going to get over. And so our goal and what I’m about to talk about is building something that is easy for anyone to use regardless. Even if it’s not the most accurate thing you could do in the world, that it’s something that’s easy to get started with.
Alex Holehouse (00:34:04):
And so to address this, we’ve been working for a while now in this simulation engine, PIMMS. Which is a general purpose polymer simulation engine, which more recently we’ve built to have a kind of protein flavored force field. And this has really been work that’s been driven different aspects and different components by almost everyone in the lab. So Ryan, J, Garrett, and Dan were the first four students to join my lab right at the start. And all four of them have worked on different aspects of improving PIMMS, building new tools for PIMMS, building insights around using PIMMS.
Alex Holehouse (00:34:39):
There’s a bit of history though, which I think I do want to take a second to mention, which is I started working on PIMMS, myself as a graduate student many, many years ago. So this is young Alex, so naive, so optimistic. And this is back in 2014. Actually, the very, very first paper that did anything with PIMMS, it doesn’t call it PIMMS, but it’s actually Erik and mine’s first paper in 2016 where if you are a real fan and you go through the supplementary information, you get to figure S17. Firstly, congratulations. And secondly, you’ll see some mention of a coarse-grained simulation engine.
Alex Holehouse (00:35:12):
These are actually very early simulations done by a simple version of PIMMS. And so we built this. I continued to work on PIMMS during my postdoc. Some things change, some things didn’t change quite so much. And we used it in a variety of different contexts. I think most effectively, again in this paper with Erik that was published with Tanja’s group in 2020, where we used it to look at large-scale phase behavior and build these phase diagrams. I’m going to talk a little bit about that in a second.
Alex Holehouse (00:35:43):
As I started my own group, I’ve continued to work on PIMMS. So I want to highlight this because I think one of the things that Rohit has done for me is sort of let me take that with me into my own group and develop it as our own lab’s software. And I think it is unambiguously true that PIMMS was built very much during my time in the Pappu lab. We’ve rewritten big chunks of it in the last couple years, but a lot of it and a lot of the ideas were seeded there. And I think it could be very easy just to not mention that, but I think that would be really disingenuous. And so I think it’s an important feature of mentorship to be willing to let people take their ideas and run with them. And in this case, their actual literal resources. This is a large code base that we built over a number of years.
Alex Holehouse (00:36:25):
So we are now finally, hopefully gearing up to release the first version, the public version at least. There is a version that’s online, but it’s been rewritten since then for winter 2022. And so this is not quite available yet, but it is very close to being available. Both the software itself and also the protein flavored force field that I’m going to show you in a second.
Alex Holehouse (00:36:48):
So what is PIMMS? So PIMMS is a lattice-based Monte Carlo simulation engine. That’s a bunch of words that may or may not make any sense to you. So lattice-based means that we discretize our three-dimensional space into voxels. These are three-dimensional cubes. And then our proteins or our polymers–I may accidentally call them proteins, but the reality is these are just polymers–occupy a single space in these voxels and polymers are then connected by bonds that hold consecutive monomers in the polymer together essentially like this.
Alex Holehouse (00:37:16):
There are some reasons technically and conceptually why this is helpful. There are also some things that this does that make it not so great. In the context of flexible polymers, this is fine. In the context of foldable proteins, we are making some assumptions about conformational entropy that maybe are not totally fair. And so I’m happy to talk about those if people want to talk about them. But I won’t bore people with them for now.
Alex Holehouse (00:37:40):
What this schematic is just showing is that we can imagine having two beads that we’re going to focus on: bead A and bead B. And there are three types of non bonded interactions that PIMMS engages with. There are these short-range interactions, which basically is every voxel around the voxel where the bead is. There’s long-range, which is two away, and then there’s super-long-range that are three away. And we also have the ability to provide, this is not required, but you can put in a backbone torsional potential, which is a very simple way to think about how bendy the chain is. And so you can make it more energetically expensive to be in this bent conformation, or better to be in this extended conformation to control the persistence length as well.
Alex Holehouse (00:38:21):
The picture I showed you before was looking at in 2D because showing pictures in 2D is much easier than 3D. This is just comparing analogous simulations run in 2D and 3D where we’ve set the moves to only allow the chain to grow in one end or another. The reason I’m showing this is because I’m going to show a bunch of other simulation data where it looks like we have nice smooth things moving through continuous space. And I just want to really reiterate that this is actually all on a lattice. We typically write out data at a frequency where you can’t really see that and the analysis doesn’t necessarily make that obvious. But as you can see here, we’re doing this, the chains in both 2D and 3D are sort of meandering through space in this very discreet way because they’re moving into different voxels on the lattice as they’re moving. So I’m not showing the lattice cells here, but they are what is keeping the beads at specific positions into 3D space.
Alex Holehouse (00:39:10):
One of the nice things about PIMMS is that it’s pretty good at accommodating large simulations. So what I’m showing here on the left is something I just ran. I realized I didn’t have a good example of this, so I ran it on my computer this morning. This is 200 different chains, 10 copies of each chain. So this is 2000 different polymers. Each polymer is 20 beads long, I guess. And so the point is that you can accommodate many … and by different chains, different types of polypeptides. So this is like 200 different IDRs all in the same box together.
Alex Holehouse (00:39:41):
And so this is a nice feature because sometimes scaling to0 many different types of sequences can be difficult for simulation engines for a variety of reasons. Whereas it becomes completely simple here. And even if this system looks like it’s complete chaos, if we kind of follow the potential energy of what’s going on here, the system is relaxing. So even in this sort of hairball, that looks like the system is arranging itself into a configuration that is energetically favorable. And so out of this sort of chaos you can find order. And if you run these simulations for long enough, one would I’m sure see this system phase separate into a whole bunch of different droplets depending on which short polymers were recruited into some versus others.
Alex Holehouse (00:40:23):
So PIMMS runs relatively quickly, it runs without any special hardware. The fast bits are written in C and the slow bits are written in Python, but it’s very easy to install. One of the things PIMMS lets us do is we can build phase diagrams. And I want to take a second just to walk through what I mean by that and how this works. Because I think this is actually the most useful illustration of understanding what a phase diagram actually is, which is what I’m showing you here is a simulation where we’re running at some temperature, temperature four specifically. And we’re running it at some total concentration of polymers, which is corresponding to where we are on the x axis. I’m running at this temperature here, which is some high value. And so we’re in the one phase region, says your standard phase diagram, we’re in this white region here. And so there’s no droplet, nothing is happening.
Alex Holehouse (00:41:07):
And if we drop that temperature to temperature three, we’re now in the two phase regime. And so what does that actually mean? What it means is that we now have two distinct phases and we can measure the concentration of those two phases in a very mechanical way simply by firstly looking at the concentration of polymers inside our droplets. We can just count. How many polymers do we have in this unit of volume? That gives us a number. The concentration is just number divided by volume. So we know the number, we know the volume, we get a concentration and we can put that here on the dense phase of the binodal. But we can then also do the same thing outside the droplet and we can calculate the concentration around the droplet. And this gives us the dilute phase around the droplet.
Alex Holehouse (00:41:48):
And so we can do this at a variety of different temperatures and we can sort of build up a phase diagram simply by measuring the concentrations inside the droplet versus outside the droplet. We don’t need to make any assumptions about the underlying free energy landscape. We don’t need to make any assumptions about numerically or analytically how we’re solving this. We can just run these simulations, see what happens, measure the concentrations almost like a computational experiment, and then use this to build these phase diagrams directly. And I sort of like this as an approach because, by doing this, we’re not assuming we understand anything about the system. We’re just running it. We’re allowing physical chemistry to do its thing and then we’re measuring the concentration inside or outside the droplet. So this scales quite nicely to multiple different components as well.
Alex Holehouse (00:42:33):
Beyond that though, we can ask more complex questions. So what I’m going to show you here is a very simple model of questions in the context of transcription factor binding. And so I’m going to walk through what we’re looking at and then we’re going to talk a bit about what we actually see. So in this very, very simple model that we have a few different components. We have chromatin, which is this long blue polymer here. I’m putting this very much in parentheses because anyone who actually works in the chromatin field is probably smashing their head into their desk. This is not really chromatin, it’s just a blue polymer, but it doesn’t interact with itself.
Alex Holehouse (00:43:02):
And on that chromatin we have these binding sites for transcription factors. These are red beads here, which interact favorably with our transcription factor, which is this white thing here. And then we also have a set of sites around one of these transcription factor binding sites in yellow here. And these sites here are going to be binding sites for an accessory factor. And the accessory factor is some other component that we don’t yet have in our system.
Alex Holehouse (00:43:25):
And so what we’re looking at here first is just a system that has the blue, the red, and the yellow. And if we run this as a simulation, what you can kind of see is this transcription factor, the white bead here, is jumping between these different red binding sites. And if we run this simulation for long enough and we ask which of these sites does it spend most time at, or which of the sites across the simulation is it bound at for the highest fraction? You find that every site is approximately equally visited. There’s some noisiness if you wait long enough and you’re willing to run enough replicas that you will get uniform distribution of these binding sites across the system.
Alex Holehouse (00:43:59):
So this is a very, very simple system. This is a simple model. This is asking just a question of where does the transcription factor go? We can then add in another component, we can add in the accessory factor. And the accessory factor interacts with the transcription factor very weakly. There are a tiny interaction that would not be measurable using any sort of normal way. It interacts with itself quite strongly. And it also interacts with these accessory factor binding insights as well.
Alex Holehouse (00:44:24):
And so we set this again system up, we added one additional component. What this then does is initially the accessory factor condenses on this binding site for the accessory factor, it condenses on itself and it goes to one specific place. But then if you follow this transcription factor, because there’s this weak interaction with the transcription factor with the accessory factor, that transcription factor now spends almost all of its time bound at one specific site on our chromatin, which is this one here that’s adjacent to the accessory sites.
Alex Holehouse (00:44:51):
And so the reason I’m showing this very contrived example and very simple example is that what we functionally have here is a scenario where you could imagine adding a second component and determining the specificity of this transcription factor for one of many possible equivalently good sites. And we’re doing that just by an additional set of interactions becoming accessible through this accessory factor. And I think this is sort of interesting conceptually because we often think about specificity. My lab thinks a lot about how transcription factors find their binding sites across the genome. And we often think about specificity as being this binary recognition problem. But what this very, very simple example is saying is that if our readout for specificity is where are you binding? There are many other things that can determine that specificity. They don’t have to actually relate to how tightly the simple interaction between the transcription factor and that specific binding site are. This is now being driven by something about the local environment essentially.
Alex Holehouse (00:45:48):
So the reason I’m showing this is just to make the point that actually if you can imagine a simple kind of question that you could ask with polymers, you can do it in PIMMS. And so we’ve had a lot of fun building different types of systems and asking about different types of architectures where this is amenable to asking these simple kind of thought experiment type questions more than necessarily trying to model any specific thing. Although you could of course use this to model specific biological systems if you had sufficient parameters and believe that the intrinsically low resolution that we’re operating at, if you think that’s appropriate.
Alex Holehouse (00:46:21):
The very last bit I want to talk about in the context of PIMMS is we’ve been working, and this has really worked that Ryan and Garrett in my lab have been pushing towards a protein-like forcefield for PIMMS. And I think one of the challenges with these sorts of very course grade models it is I’m more comfortable saying protein-like than protein because there’s a lot of things we’re ignoring. We’re assuming every amino acid is the same size, we’re assuming there’s no directionality on the side chain. But at the same time we can encode a reasonably diverse array of physical chemistry both in terms of the bead-bead and the bead-solvent interactions as well. And we have these multiple length scales at which interactions can happen.
Alex Holehouse (00:46:56):
So there are two components to this forcefield. One of them is the backbone flexibility. And so we parameterize the backbone flexibility of this forcefield by running all-atom simulations where we turn off all of the interactions except the steric overlap. So this gives us the intrinsic stiffness of the chain as a function of sequence. And then we have these non-bonded interactions, and we built the non-bonded interactions by starting somewhere that made sense. So we actually started using this really nice set of parameters developed by Jerelle Joseph and Alex Reinhardt in this paper that came out from Rosana Collepardo-Guevara’s group late last year, in November last year. And so several people in my lab got really excited by this paper and started building, working with this Mpipi model. And so we used this as a starting point and then slowly refined the parameters to take into account the fact that our simulations are being done on a lattice. We do explicitly consider the solvent and we’re having to discretize the interactions across the length scale that the lattice affords.
Alex Holehouse (00:47:52):
So just to give a summary, how well does this actually work? The answer it turns out is surprisingly well. I say surprisingly well because I wouldn’t have expected it to have been necessarily able to recapitulate behaviors this well. So what we’re showing here is 101 different proteins where we have experimentally measured small angle x-ray scattering data for disordered proteins. The X axis here is the experimentals SAXS data. The Y axis here is the simulated radius of gyration here. And what you can see is we get a reasonably good agreement. There are definitely some outliers, there’s some big outliers up here. These actually typically are very highly charged proteins. And we think that might be something to do with the fact that either we are overestimating certain contributions to the electrostatics or there may be local neutralization of charged residues under some context, not totally sure.
Alex Holehouse (00:48:40):
But by and large, we’re getting things right in the right way and our Pearsons correlation coefficient is reasonably good. Our root means squared error is comparable. If compare this back against Mpipi, which again I think is a really, really fantastic model. We’re in sort of the same ballpark. We’re not quite as good, but we’re in the same ballpark. And if we take this larger set of proteins down and do it with the subset that we happen to have that overlaps with these 17 proteins we get something now that is comparable in terms of the Pearsons correlation coefficient and the error as well. Like I say, we work with this Mpipi model a lot as well. It gives us lots of things that this doesn’t give. So I would never propose or suggest that this is better than this Mpipi model at all. It’s not. But the nice thing is that these individual simulations can take on the order of tens of seconds. So the throughput that you can get for these types of things is nice and can be done locally in a way that’s very easy to do.
Alex Holehouse (00:49:38):
An important question having worked in this space for a while is if we’re doing things and we’re getting the right RG, there’s actually many ways you can get the same RG. There’s many possible ensembles that might give you the same radius of gyration. So are we getting these numbers for the right reasons? In such a large data set it’s unlikely we’re getting them completely wrong, but it’s worth asking this question. We can do this. We actually take the ensemble, as we get out of PIMMS and we can back calculate scattering profiles, so the actual full small angle x-ray scattering data.
Alex Holehouse (00:50:05):
And if we do this, we see again this was not something I would have necessarily expected. We have to make a couple of assumptions by setting the solvation shells a little bit bigger than you’d expect, but perhaps that’s not so surprising because we don’t have side chains. But if we have a fixed slightly larger solvation shell, we get something where our computed scattering profile, which is in black, versus our experimental … No, that’s backwards. Our computer scattering profile should be in red and our experimental scattering profile should be in black. We see pretty good agreement.
Alex Holehouse (00:50:34):
So this region up here is where you learn about the radius of gyration. This is the Guinier region. But even as we go across the different wave numbers, we’re seeing pretty good agreement across here. This is for Ash1, which is a protein that Erik and I worked on in 2016-2017. This is for part of the RNA polymerase 2 CTD domain. Again, Scott Showalter collected really nice clean scattering data on this. We see good agreement. This is for the hnRNPA1 protein that Erik and I worked on. Again, these are chosen in part because I had the data handy, but we have data for a larger subset.
Alex Holehouse (00:51:06):
This is scattering data for a set of different A1 variants from Anne Bremer, Mina Farag, and Wayde Borcherds’ paper that came out earlier this year. Again, we’re doing a pretty good job across the full regime of the scattering profile, capturing the behavior that we’d expect to see. And this to me was really reassuring because it was very unclear to me if a super simplified un-latticed model would really be getting us the right ensemble shapes. And by all the evidence we have at the moment, by and large they seem to be. Of course there are examples whether these things are not or not in such good agreement and we get the RG wrong, but where we get the RG right we tend to be pretty much bang on the money in terms of getting the scattering profiles.
Alex Holehouse (00:51:46):
So with all that in mind, what about phase separation? That’s sort of a big focus. That’s why we’re here one might say. These are simulations I just ran this morning because I realized I didn’t have any nice ones here. These are A1-like chains. So this is just a sub region of hnRNPA1. It’s about a 56 residue region. I’m using a slightly smaller thing because I wanted to make sure I had time to run it before this talk. So we have a hundred copies of this chain in a box. We get this nice droplet.
Alex Holehouse (00:52:14):
We can do exactly the same thing from an equivalently size regional of DDX4. And even though these sequences are pretty similar, you can see there’s actually or just visually differences between what these droplets look like. The interfaces here are much fuzzier. This thing’s actually much less stable. It will fall apart if we turn the temperature up a little bit. And so we’re able to build these sort of condensate-like simulations where we could in principle build phase diagram from on the order of hours instead of on a single CPU on a desktop computer.
Alex Holehouse (00:52:41):
And so this I think affords a level of throughput in the context of a variety of questions that we are now exploring quite heavily. And this protein flavored forcefield really is a relatively recent development. We’ve been working on it over the summer basically. And we now have a version that we think is probably doing the right things for the right reasons, which is reassuring and good.
Alex Holehouse (00:53:02):
I’m going to qualitatively compare some real data. This is simulations run for the full-length hnRNPA1 where we varied the number of aromatic residues to match the number of aromatic residues that Erik and Ivan and I looked at in our work a couple years ago. So the dots here are from the simulations. The black curve is fitting this to Flory-Huggins theory. And as we increase the number of aromatic residues, we increase the temperature. This temperature is not real temperature, but for the sake of this discussion, that’s fine. And as we decrease the number of aromatic residues, we decrease the critical temperature as well.
Alex Holehouse (00:53:35):
And so in both cases we’re seeing this sort of approximately equal effect of adding or removing aromatic residues. This matches up quite nicely with what we’d seen experimentally as well. So on the right here I’m showing experimental data. This is on a log scale on the X axis. This is not on log scale, that’s why the shapes are different. If I were to plot this on a log scale it would look much more alike. But the key thing to know is the spacing in these critical points is uniform, and we see this kind of diminution or enhancement in the driving force for assembly as a function of the number of aromatic residues. And that function, if we calculate an apparent Flory chi parameter for these things, we see very good agreement between the PIMMS simulations done here and then the simpler simulations and experiments done here as well.
Alex Holehouse (00:54:19):
So this gives us some confidence that we are able to capture at least qualitatively, if not even quantitatively, the effect of sequence-specific effects on both single chain and phase behavior. Maybe not as accurately as some of the models that have been released in the last year or so. So both Rosana’s group and Kresten Lindorff-Larsen’s group have put out really nice looking models and there’s a couple of others as well. I think the big advantage we have here is just the ease of use and the performance. So you can do this on your laptop at home as you drink coffee. And so that’s what I have done this morning in the context of preparing this talk.
Alex Holehouse (00:54:59):
One final note that I made myself put in because I wanted to just make sure I remembered to say it, which is, in the context of PIMMS, we are throwing out a lot of things that might be important for IDRs. So there’s no secondary structure. In principle you could build that in, and perhaps we will, but we haven’t done so yet. And so really we’re treating these as protein flavored flexible polymers as opposed to proteins. And it is important to remember that not all IDRs are going to phase separate under physiologically relevant or even physiologically close conditions. And there’s a whole variety of chemistry here that’s going to determine this.
Alex Holehouse (00:55:32):
As a field, the disordered protein field largely worked from proteins that didn’t self assemble for a decade because it was really difficult if they assembled. So if you look at the disordered protein biophysics work done up until about 2013, 2014, it’s almost exclusively done on really soluble proteins. And so I say this just because I think, again, this audience probably is well aware of this, but I think it bears repeating: this idea that IDRs and phase separation are interchangeable is not necessarily the case at all. It is true that many of the proteins that phase separate have IDRs, but it is not true that IDRs means things will phase separate. And Erik and I wrote a review paper about this in a journal that apparently no institutions have subscriptions for. And so there is a tiny URL here that says not all IDRs. And if you go to that you will be able to download a PDF of the paper, or feel free to email me. I get a request about once a week since this is published. So it’s fine. I’m used to it at this point.
Alex Holehouse (00:56:26):
Very quickly as we finish, PIMMS, like our other tools, is meant to be incredibly easy to install. I had PIMMS written here then I realized I’m not 100% sure PIMMS is actually available as a name, so I’ll have to double check that. But once we release it, it’ll be a single install here. You won’t need to do anything fancy, you won’t need to compile any other software. It will just work out the box. Linux, Mac, and Windows.
Alex Holehouse (00:56:46):
Running simulations is simply as easy as writing pimms-k and providing a key file. The key files are built and written to be read by humans. So what I’m showing you here is the key file is just defined how the system gets set up. So for example, we can say the box is going to be 50 by 50 by 50. We have a parameter file which defines how the beads in the system interacts. I’ll talk about that in just a second. Say how many chains we want. So we have a hundred copies of this chain here. We can talk about how long we run the simulation for. We can define things about the analysis around how often we write snapshots out to visualize and work with. You can control things about the move set and there’s a bunch of other options that you can add as well. But in principle, you really need to provide a very small number of options to be begin running simulations from your computer.
Alex Holehouse (00:57:32):
The only other file you need other than this key file is a parameter file. The parameter file just defines how the beads in the system interact with one another. And so that parameter file then again is a very simple text file where you define the names of the beads and the interaction between them. So here we have a system with two beads, A and B. We define the AA interaction to be zero, AB as negative five, so that’s attractive. BB is negative three, that’s less attractive, but still attractive. And then we define interaction with the solvent. So A zero the solvent is nothing, and B zero at solvent is plus three, which means that this B residue is actually hydrophobic. This is meant to be easy to read and write. And so the goal here is to have something that is easy to work with so that you can go from idea to testing hypotheses very quickly.
Alex Holehouse (00:58:16):
Our goal is to have this up and installable by the end of 2022. I put this disclaimer in here because I realized this talk is being recorded so it’d be very easy to prove me wrong if this does not happen. So that is what we’re aiming for. We would actually like to have it done sooner, but I’m giving myself a bit more of a window. But that is the goal. And I think broadly we have found it to be really useful both for asking biophysical or biological questions, but also just for honing our intuition as to how things may or may not work.
Alex Holehouse (00:58:47):
So with that, I’m just going to finish up and conclude. I realize we’re running a little bit over. I’ve talked a bit about some of the sequence-based tools and I didn’t talk about some of the other tools we have. We have tools for rationally designing disordered proteins. We have tools for predicting things. This is not quite out yet, but we’re anticipating early 2023 this will be published. This ability to play. And we built these tools to be such that they can work by themselves and that should be easy and that should be fine. But the idea of course is that we’d like to build this kind of loosely coupled ecosystem where these different tools work together. So within SHEPHARD you can predict disorder, then you can design sequences or you can train models based on things that you’ve learned or calculated in inside SHEPHARD. And so again, creating required dependencies is not good, but enabling this loosely coupled collection of software I think is a useful way to build a tool base that we and hopefully others will find useful.
Alex Holehouse (00:59:41):
I will then quickly say the simulation based tools I think offer a complementary approach for thinking about IDR behavior both by themselves and in the context of interacting with other things. And I haven’t discussed this at all, obviously, but we have this analysis tool for working with all-atom all-coarse-grained ensemble disorder proteins. This is SOURSOP, this has been out for a while and there’ll be a preprint of this hopefully in the next couple of weeks. We’re just finalizing it up. Finishing it up. And this has been led by Jared when he was a graduate student in Rohit’s lab. He’s now the senior technology scientist in the lab and really does an incredible job in that context. And so this has been an ongoing collaborative work with Rohit’s lab. This is here now. If you are a simulation person that looks at IDPs, take a look. We have a whole bunch of nice features in there.
Alex Holehouse (01:00:29):
With that, I’ll just say thank you to the people in my group. Almost everything I spoke about today was done by one or more of Dan, Garrett, J, and Ryan. Thanks to our fundings sources, especially actually so Dewpoint funded part of the work, especially the SHEPHARD project, and has been really generous with their time and their money. And I think having companies involved in this space that are willing to help and be involved–they’ve sponsored poster prizes for IDP seminars, activities–makes a really big difference to the community at large. And so with that, I’m going to stop and I’d be happy to answer or any questions.
Erik Martin (01:01:05):
There seem to be a number of questions building in the chat and so I’ll call people and have you unmute and ask your question. But first I’m going to take the opportunity to ask a question myself because that’s what I get to do because I have the microphone. This might be a bit esoteric to the IDP simulation field, but I couldn’t help but notice that your PIMMS force field got the proteins wrong in a more expanded sense where the field has traditionally gotten it wrong the other way around. But you’re doing pretty well with the compact sequences and I was wondering if you have a good explanation for that?
Alex Holehouse (01:01:46):
Yeah, so there’s a few things that could involve a very, very, very long discussion, which I don’t think anyone except you and I would be interested in. I think the bottom line is that probably we get the things too expanded. Because the things that we definitely get that are too expanded are really charged. And maybe that’s because our ability to correctly get long-range electrostatics done in an appropriate way is slightly off in terms of how we’re treating it. That’s certainly possible and something we’re looking at refining.
Alex Holehouse (01:02:15):
I think also something that I am increasingly, I guess, concerned slash excited about is work from Rohit’s group, making the point that you have this charged-state heterogeneity. And so if you have highly charged sequences, the local pKas can be shifted by the fact you have this charge around it. And so you can get neutralization proteination on deproteination of charged residues within pH regimes that you would never naively expect based on the model compound pKas. So for both prothymosin alpha and we have one of the histone H1 tails, which is a whole bunch of lysines, those for us are way too expanded. And I have a sneaking suspicion that perhaps this is because of local pKa effects, which may be actually engendering some degree of local compaction, but I’m not totally sure.
Alex Holehouse (01:03:02):
I think one thing that have been thinking about is for sequences that have aliphatic hydrophobic residues, how should we think about aliphaticity in the context of these kinds of forcefields? Especially where we have implicit solvent. We are imposing an implicit solvent model on here. In general as a lab, well I think all of the folks that we operate with use implicit solvent. And thinking about the driving force for the hydrophic effect when you don’t have an explicit solvent, I think is something that is more complicated than perhaps I had thought about previously. So we’re thinking about ways to get around some of those challenges as well.
Erik Martin (01:03:45):
Cool. Hopefully that was interesting to more than just the two of us, but let’s see. Next I’d like to ask Avi to unmute and ask a question. And this is not just because I kept not calling on him last time I was the host.
Avinash Patel (01:04:02):
No hard feelings, Erik. Great, Alex. That was a wonderful talk and nice to see you again.
Alex Holehouse (01:04:06):
Good to see you, Avi.
Avinash Patel (01:04:09):
That was a wonderful simulation of the transcription factors and the accessory factors and all that. But there is another notion that these factors or the condensates can sequester transcription factors away from the DNA binding sites. Do you observe that as well or have you ever seen that, or is it always almost exclusively in your current models happens at the DNA binding sites?
Alex Holehouse (01:04:34):
No, you could get either. So I’ll just go back to the slide. So depending on your perspective, this is either interesting. And I think it’s interesting because it provides us with a way to think about and visualize our problem. I think there’s a lot of value in being able to see things and talk about them. But at the end of the day also we are getting … This is a model that I … It’s not parameterized like, can we change the parameters to behave in a certain way?
Alex Holehouse (01:05:03):
So we could totally make the accessory factor not bind to this site here. And it would do exactly as you’re saying, it would now form somewhere that wasn’t on the chromatin and would sequester this transcription factor away. And I think all of the above are possible. I think the bigger question that we are really interested in is, how does this specificity, either for binding under transcription or broadly in condensates? How is this modulated by the complement of things that are there?
Alex Holehouse (01:05:34):
So I think oftentimes at least I have thought about a protein saying, yes, it gets into a P body or the nucleolus or it doesn’t. And I think more and more we’re starting to think that maybe that binary sort of yes or no is not the right way to think about it. And rather it depends on what else is there. And I think from a control perspective this is really appealing because it means recruitment or exclusion will depend on the cellular state essentially. And so now you have this way to build, and that will then release things and then other things will go to other places.
Alex Holehouse (01:06:02):
So you have this intrinsically interconnected network of specific recruitment or exclusion where everything can depend, whether it does, can depend on other things as well, which from a building a cellular surface perspective I think opens up some doors that would previously have been really difficult to rationalize. So the bottom line being I think in our simulations all these things can happen. I think in the cell probably these things do happen and the whole gamut of possible things can happen as well. And I think our challenge putting our biologist’s hat on, our challenge is to say, well this is an example of a thing that can happen. Do we have ways to predict, interpret, or understand when it does happen and make testable predictions in that context of real systems where we can have readouts that are biologically meaningful or clinically significant?
Avinash Patel (01:06:49):
Perfect. Thank you. Got it.
Erik Martin (01:06:55):
Okay. I think next I’m going to ask Bede to unmute because I believe he has a follow-up question to that. Oh, you’re here.
Bede Portz (01:07:07):
Hi, Alex. Have you done things like altered the spacing of the binding sites and/or added a boatload of what you’re calling the accessory factor? And at what sort of threshold do you start to occupy additional sites rather than grow the size of the existing site?
Alex Holehouse (01:07:32):
These are really good questions. So because this is not a talk about this specific question or problem I’ve glossed over a ton of details. One thing we see is, depending on the strength of the accessory site for itself, you can have a scenario where you are, and this comes back to work from Stephan Grill and other people. That you are really below the in isolation saturation concentration. But the binding sites function as a sink, like a nucleation source. So you get basically a pre-wetting transition at that site.
Alex Holehouse (01:08:06):
So you can have a scenario where there’s a finite size for this droplet, basically. It won’t grow beyond a certain size because if it gets sufficiently big–the way to think about this is the beads, I’m going to call them beads here–the beads on the outside of this droplet don’t actually know there’s a binding site inside. You basically go beyond the correlation length of the droplet. And once you pass that correlation length, then things can’t… like entropy wins and you don’t get that anymore.
Alex Holehouse (01:08:29):
So depending on the strength of the accessory factor for itself, you can control then which of the binding sites nearby are actually going to be accessible. So if this is really strong, you will get your first scenario. You keep adding more in, it’ll grow, grow, grow, grow, grow. And eventually the whole thing is engulfed in one giant condensate and then you kind of lose specificity again.
Alex Holehouse (01:08:50):
On the other hand, if these are sufficiently weak that you’re doing sort of pre-wetting transition here, then you don’t ever grow beyond a certain size, and regardless of how much of this accessory factor you add in–I guess until probably something else happens at a certain point; maybe you start to recruit the transcription factor separately–but you don’t see this growth. Basically, as long as you’re under the intrinsic saturation concentration of the accessory factor, you just see local interactions.
Alex Holehouse (01:09:19):
I think this then combined with your first question of changing the spacing of these binding sites gives you a whole bunch of axes to play with in terms of thinking, well what things are going to determine which sites get bound or don’t get bounded as the case may be. And this is something we’re actively working on. Because I think on the one hand this might not be easy to interpret, but on the other hand, as you well know, there’s a lot of very high resolution published data looking at which transcription factors bind where and under what conditions. And so our hope is that having some physical model of what we think might be going on–we maybe lack a little bit of the three-dimensional information into chromosomal information, at least in yeast–but take what we think is going on based on existing data plus this physical model and figure out if there’s an interpretable code for how some of these IDRs and transcription factors determine which sites get bound or don’t get bound as the case may be.
Erik Martin (01:10:15):
Great. I see a few questions about PIMMS and nucleic acids. So I’m going to ask Tamas Lazar to unmute and ask his question.
Tamas Lazar (01:10:30):
Hi, Alex. My question is, if I want to simulate proteins with nucleic acids, can I do that in PIMMS? And will I find a parameter file that enables me to parameterize my nucleic acids, let’s say RNA, if I want to test different RNA sequences with a protein of interest?
Alex Holehouse (01:10:52):
So there’s a few layers of challenges to this question. So the first layer of challenge is that, in general, and I don’t know, there may be people in the audience who work on this, and if they do I apologize. But in general, the quality and the, and I call this wrongly, the fidelity of forcefields, regardless of your resolution. So just from all-atom up to coarse-grained for nucleic acids is not fantastic. And part of this is the fact that we just don’t have a ton of data. I’m taking advantage here in terms of our comparison to experiment. We have these 101 proteins, we have many others that we’re not comparing against here. Where’s my little thing? There we go.
Alex Holehouse (01:11:29):
So people have measured x-ray scattering for many different sort of proteins. This is definitely not all of them. We have others. For nucleic acids, we don’t have analogous data sets that are on this scale. And that’s the one problem. The second problem is that for nucleic acids you just have way more degrees of freedom. And so that’s the second problem. There are more dihedral angles in a nucleic acid per monomer than there are poly peptides by a long way. And so how that then influences chain flexibility and what that does to these very simple spherically symmetric models is I think a complicated question to answer that. But I briefly considered it years ago and decided it’s too hard. So it is not too hard for anyone, but it is too hard for me.
Alex Holehouse (01:12:13):
And then I think the final challenge is that there’s a whole bunch of different types of interactions in the context of nuclear acids. So you have obviously Watson-Crick base pairing, but you have also stacking, you have the really strong electrostatics of the phosphate backbone. These are all types of things that conventional fixed charge forcefields are not particularly good at anyway.
Alex Holehouse (01:12:34):
Now, does this mean that we couldn’t parameterize at least a simple coarse-grained model that maybe wouldn’t have sequence specificity or might have the flavor of RNA? That I think we probably could do. And we’ve been thinking about doing that. There’s a nice version of that in Jerel’s Mpipi model that we’ve found, maybe again to my surprise, maybe my intuitions just really wrong, but we found work quite well. We have a paper that I hope will be somewhere soon looking at that.
Alex Holehouse (01:13:02):
So I think if you want to say I have my protein condensate and I add in poly U, is it going to go? Yes or no? That kind of question I think we’ll be able to answer. If the question is, I have a specific tenmer and I want to ask if changing the sequence or adding a loop or something is going to change specificity, then I don’t think we’re going to be able to answer now. And I would be surprised if given the resolution that PIMMS affords if we would ever be able to answer a question of that kind of class essentially. So it’s a good question and it’s something we have thought about a lot. We work on protein and nucleic acid interactions and I think it’s tricky for a variety of reasons.
Tamas Lazar (01:13:43):
Poly U versus poly A versus poly G.
Alex Holehouse (01:13:44):
That, maybe. I think you could imagine saying, well we’re going to make poly G really sticky for itself for forming G quadplexes and then we’ll make A, C, and U less sticky for themselves. I think there is probably degrees of coarse-graining you could do in that way. I don’t know, I’m less familiar with the available experimental data. Lois Pollack at Cornell has done a bunch of really nice small angle x-ray scattering data on some short single stranded nucleic acids. But in general, if we wanted to calibrate to ask how well are we doing at a resolution that we should expect to be accessible. So you could go and say, well let’s compare structured RNAs that we get from either cryo-EM or NMR and see if we predict those structures. I don’t think we’re ever going to get those structures right. So now the question is okay, how would you know well you’re doing? Maybe there’s ways to do this. I don’t know if the data exists yet. I think it could do. That might be a useful thing for the community to think about, but I don’t know. I think best case we would be able to do poly A is different to poly U to poly G to poly C. But I don’t know if we’d be able to do that. I could see even that being tricky.
Tamas Lazar (01:15:00):
Erik Martin (01:15:03):
I have a question that is going to be of a very similar flavor coming from Kamran at Dewpoint.
Kamran Rizzolo (01:15:14):
Hello. Thank you, Alex, for the great talk. Just in regards to the PIMMS tool, have you thought about, or have you already run simulations including any tool compounds or well-characterized compounds that … It sounds like it’s an obvious question so I wonder-
Alex Holehouse (01:15:32):
No, no, no. Well, I think it depends on your perspective. It either is or it isn’t. We haven’t. I am really … So the way that force field development typically works–especially in the context of condensates–is you build some model like we’ve done here and then you calibrate based on either a combination of physical intuition or measurements against single protein experiments. Then you ask how well do you predict condensate behavior?
Alex Holehouse (01:16:00):
But my suspicion is that if one were to have a set of condensates that were chemically different and then you just screen for recruitment or uptake of a library of small molecule compounds, you could flip this question around entirely and you could learn the chemistry because you have the polypeptide chemistry already built into the model. You could learn the chemistry of the small molecules with respect to recruitment or exclusion and you could parameterize a forcefield from condensate exclusion / recruitment.
Alex Holehouse (01:16:29):
We haven’t done this. I’m not a chemical biologist and so I don’t necessarily feel like I have the background to think about what kinds of things would be easy or hard. I think if we somehow fell into some large pot of money, this would be a really interesting question to look at. My problem is that it may be something that very quickly we discover will not work and there are reasons that it might not work. For example, we don’t have a great control. Basically everything in PIMMS has to be … The size of every basic unit is one beads or polymers of beads. So we can’t change the bead sizes.
Alex Holehouse (01:17:07):
So this might raise an intrinsic issue which is, if we have a condensate that’s made of polymers of polypeptides and then we put in a small molecule drug, we might not be able to make the small molecule small enough to weasel its way through if there’s particularly close interactions. And that might be really physically unrealistic and we might lose a whole bunch of things that would be relevant. In principle, maybe you could parametize that out. But in practice, just from how the code has been, the sizes are fixed.
Alex Holehouse (01:17:34):
So I think it’s a really interesting question and we haven’t looked at this in detail. And given infinite time and money, I would love to. I think there’s probably a lot of … The big advantage we have here is throughput. So we could screen thousands of two- to three-bead molecules and ask, what is the partition coefficient into one of these condensates? in a couple of days. So that would be really interesting, but there are a whole bunch of reasons why it might not work that I don’t necessarily know if I have a great intuition for or against.
Erik Martin (01:18:09):
Cool, thank you. Next I’d like to ask Alan Underhill to ask another question about the transcription factor systems.
Alan Underhill (01:18:19):
It’s just building on that PIMMS situation. And thanks again, that was a great talk and thanks so much for the tools you create and making them accessible to community. So is this scalable to a transcription factor that may actually have very complex binding site architecture? So it may use up to three sub domains to recognize target sites with different affinities. So really trying to look at high resolution ChIP-Seq data, look at the proportion of binding sites across the genome and kind of reverse engineer what the conformational ensemble of the protein would actually look like?
Alex Holehouse (01:18:54):
Yeah. So these simulations are from a grant that the NIH decided not to fund. If they fund it this time, this is I think one of the questions I’d be really interested in playing with. I’m not a chromatin biology person. I was up at Cornell a few weeks ago, I was talking to Frank Pugh about this as well. I think there are a bunch of really potentially interesting questions to ask here, and I think at some level we have some tooling that would be really effective for asking those questions.
Alex Holehouse (01:19:27):
I sort of feel like it’s something where we probably need to work alongside someone who understood the biology better. Because I certainly don’t, and as a lab we are stretched pretty thin, as I suspect that the people in my group who are on this call would agree to. So I think yes. To answer your question, I think there are reasons why it might be difficult. I think in principle, you could create a system where you just explicitly have some conversion of lattice spacing to 3D dimension and you model all of the chromosomes and you model everything and you say, all right, what is the minimal set of computational components we need to recapitulate some subset, say, of observed ChIP-Seq data, for example?
Alex Holehouse (01:20:15):
I think that would be interesting, but I think that would be difficult. This here is easy because here I’m asking a toy question, which is, can we get specificity for a site based on some other protein being or not being there? I think as soon as we’re going to say, all right, can we actually recapitulate and describe real biological systems? We’re forced to confront some of the simplifications that we make in this model, which may or may not be easy to confront and there’s a bunch of complexities in there. But in principle, yes. In principle, yes.
Alan Underhill (01:20:47):
That’s great. Thanks very much.
Erik Martin (01:20:53):
Okay, I think we have one final question about of technical aspects of PIMMS. And I’m going to probably pronounce this name very wrong, but Xiangze Zeng, if you’re still here.
Xiangze Zeng (01:21:07):
Hi, yeah. Hi, Alex. Very wonderful talk. I think I have two quick questions. I think somehow you already answered my first question. You mentioned that for all the beads the bead size are uniform and they have one bead. Could they have different lengths between two additional beads?
Alex Holehouse (01:21:32):
Yeah, it’s a really good question. So to reiterate, the answer to your first question is because of how things are built, we do enforce the size of the beads to be a single size. You can sort of game the system a bit. A big chunk of how we’ve been playing with this over the last year is this, okay, it’s almost like coding in Assembly or very basic C. Which is you have a very finite number of building blocks or things you can do. How can you build complexity out of that small set of building blocks?
Alex Holehouse (01:22:03):
So one of the things you can do to play with the size is you could say, well the intrinsic bead size is this. If you make this such that the solvation is so favorable it can never be desolvated, you can, in principle, basically make the bead a three-by-three cube. You can do that. So that’s one way to make bigger beads because essentially the energetic penalty of anything being within one unit on the lattice would be so great that it never happens. So you can do that, but that’s basically as far as you can go and we don’t have … I guess you could do that and then you could say, okay, for every other bead you have this being really repulsive and this being really repulsive. So there are some tricks you could do, but it would get tricky. Yeah, it wouldn’t scale particularly well just from numerical precision issues more than anything else.
Alex Holehouse (01:22:47):
In terms of differences in lengths between beads, there’s a hacked version that I played with a year or so ago where we added in distance constraints. So you can say, okay, for bead A and B we’re going to fix them so that they’re are always some distance apart. In principle, you could do that. You could say, all right, now I can build polymers with arbitrary inter-bead distances using those constraints between different beads. That’s not in any of the public versions or any of even the private versions where it was a specific thing that I did to solve a specific problem. And it would require a fair bit of re-engineering of the underlying codes to put that into the main version. I think we probably will do that. I think having constraints or restraints is really useful. And they’re not in there right now, but we’re not planning on doing that for this first release just because it will necessitate some restructuring of the underlying architecture, which is fine, but we’ve been just restructuring a little bit here and there for years. At a certain point you need to say, All right, we’re done. We’re going to put this out and then we’re going to have a second version that puts in extra bells and whistles that we’d like to have.
Xiangze Zeng (01:23:57):
I see. So is it possible to use PIMMMS to simulate a folded domain with an IDR?
Alex Holehouse (01:24:05):
I had this incredible research tech Isham Taneja who worked in my group last year, and he built some software for taking continuous space or a Python script taking continuous space PDB structure and latticizing it. So putting it onto the lattice, which is not actually … it’s actually a hard geometry problem because the optimal solution for that is not necessarily easy to get. You can do that. I think one of the challenges is that as soon as you have more than one folded domain, the way you can reorient that folded domain, basically focus has to be along the cardinal axis of the lattice. And what this then means is the rotational translation of freedom of that particular folded domain is tiny compared to what it should be. So when we’ve looked at this before, what we found is that two folded domains would just stick together. Because the entropic penalty for sticking together is way lower than it should really be.
Alex Holehouse (01:25:07):
So I am a little hesitant for putting folded domains in because I think there’s a lot of ways people could do things that would be really problematic and the ability to freeze things is not built into the main version. I think, again, we should maybe revisit this because I think there is some utility here, especially in looking at how IDRs interact with folded domains. Part of the other reason is we’ve been working, I sort of mentioned this, but we’ve been working a lot with Jerel’s Mpipi model and it’s great. And so really for our IDR-folded-domain interaction, that’s where we’ve been focusing our time and effort and it’s worked really well. And so for us, we don’t have any huge need because for our biochemical, biological, biophysical needs, they are met by that already. But we can and maybe should do that at some point, putting folded domains in a more user-friendly way.
Xiangze Zeng (01:25:55):
I see. One more question. So I know that PIMMS is a Monte Carlo simulation, how reasonable is it to directly correlate the movements in PIMMS with the dynamics of [inaudible 01:26:11]
Alex Holehouse (01:26:18):
Probably not very. So I would not recommend people do that necessarily. For systems that are sufficiently similar in certain ways, and this is a longer more complex conversation, you can use a move set that, for example, doesn’t allow any large scale translational movement, so only allows local reptation. And you can then compare one versus the other, as long as the number of chains doesn’t change and the number of beads doesn’t change, you can, in principle, compare things and get some qualitative insight.
Alex Holehouse (01:26:47):
Even there though it’s kind of hand wavy. We have done this, I should say, so I shouldn’t be too critical of this as an approach. But in general, I think it’s really easy to trick yourself with trying to convert MC to dynamics. We have thought about putting in specific schemes for doing kinetic Monte Carlo type things. So where you weight movements according to the mass to give appropriate drag coefficients for diffusion, things like that.
Alex Holehouse (01:27:14):
Again, I think this is something that would be useful. The other thing we could do is discrete molecular dynamics on the lattice. There was a whole phase of doing that in the late ’90s when lattice models were all the rage. So I think there are ways to think about doing this in terms of dynamics. We haven’t yet. And again, I think it’s something I’d like to do, but it’s not something that we’ve looked into particularly deeply yet. It’s something I’ve thought about a lot over the years and have never really done very much with. Put it that way.
Xiangze Zeng (01:27:43):
Cool. That’s a super helpful. Yeah, thank you. Thank you so much.
Alex Holehouse (01:27:45):
Erik Martin (01:27:49):
So I think now that we’re 34 minutes past the end of the seminar.
Alex Holehouse (01:27:55):
Jill Bouchard (01:27:57):
Don’t be sorry. This was a great discussion. I’m super happy about it.
Erik Martin (01:28:00):
Yeah, you can never complain about good discussion. I think we’ll …
Jill Bouchard (01:28:04):
But we are finally out of questions.
Erik Martin (01:28:06):
Yes. Jill, do you have any final comments?
Jill Bouchard (01:28:09):
Yes, I do. Thank you for everyone who’s still here. Thanks for everyone who came and might have needed to leave a little early, but this was fantastic, Alex. We just really loved seeing everything. I have a feeling this is going to be a very popular video to go back and keep collecting more information. It’s just jam-packed.
Alex Holehouse (01:28:28):
Well, there’s lots of things we can go back and figure out that I was wrong with what I said in my long rambling answers. So that would be entertaining if nothing else.
Jill Bouchard (01:28:34):
I think it’s more for the people that want to hear your talk. But yeah. Thank you. Thank you everyone. I guess we’ll get off and we’ll be back in November and feel free to come back.
Alex Holehouse (01:28:47):
Perfect. Thanks so much for having me.
Jill Bouchard (01:28:50):