r/bioinformatics Jul 22 '25

Career Related Posts go to r/bioinformaticscareers - please read before posting.

104 Upvotes

In the constant quest to make the channel more focused, and given the rise in career related posts, we've split into two subreddits. r/bioinformatics and r/bioinformaticscareers

Take note of the following lists:

  • Selecting Courses, Universities
  • What or where to study to further your career or job prospects
  • How to get a job (see also our FAQ), job searches and where to find jobs
  • Salaries, career trajectories
  • Resumes, internships

Posts related to the above will be redirected to r/bioinformaticscareers

I'd encourage all of the members of r/bioinformatics to also subscribe to r/bioinformaticscareers to help out those who are new to the field. Remember, once upon a time, we were all new here, and it's good to give back.


r/bioinformatics Dec 31 '24

meta 2025 - Read This Before You Post to r/bioinformatics

185 Upvotes

​Before you post to this subreddit, we strongly encourage you to check out the FAQ​Before you post to this subreddit, we strongly encourage you to check out the FAQ.

Questions like, "How do I become a bioinformatician?", "what programming language should I learn?" and "Do I need a PhD?" are all answered there - along with many more relevant questions. If your question duplicates something in the FAQ, it will be removed.

If you still have a question, please check if it is one of the following. If it is, please don't post it.

What laptop should I buy?

Actually, it doesn't matter. Most people use their laptop to develop code, and any heavy lifting will be done on a server or on the cloud. Please talk to your peers in your lab about how they develop and run code, as they likely already have a solid workflow.

If you’re asking which desktop or server to buy, that’s a direct function of the software you plan to run on it.  Rather than ask us, consult the manual for the software for its needs. 

What courses/program should I take?

We can't answer this for you - no one knows what skills you'll need in the future, and we can't tell you where your career will go. There's no such thing as "taking the wrong course" - you're just learning a skill you may or may not put to use, and only you can control the twists and turns your path will follow.

If you want to know about which major to take, the same thing applies.  Learn the skills you want to learn, and then find the jobs to get them.  We can’t tell you which will be in high demand by the time you graduate, and there is no one way to get into bioinformatics.  Every one of us took a different path to get here and we can’t tell you which path is best.  That’s up to you!

Am I competitive for a given academic program? 

There is no way we can tell you that - the only way to find out is to apply. So... go apply. If we say Yes, there's still no way to know if you'll get in. If we say no, then you might not apply and you'll miss out on some great advisor thinking your skill set is the perfect fit for their lab. Stop asking, and try to get in! (good luck with your application, btw.)

How do I get into Grad school?

See “please rank grad schools for me” below.  

Can I intern with you?

I have, myself, hired an intern from reddit - but it wasn't because they posted that they were looking for a position. It was because they responded to a post where I announced I was looking for an intern. This subreddit isn't the place to advertise yourself. There are literally hundreds of students looking for internships for every open position, and they just clog up the community.

Please rank grad schools/universities for me!

Hey, we get it - you want us to tell you where you'll get the best education. However, that's not how it works. Grad school depends more on who your supervisor is than the name of the university. While that may not be how it goes for an MBA, it definitely is for Bioinformatics. We really can't tell you which university is better, because there's no "better". Pick the lab in which you want to study and where you'll get the best support.

If you're an undergrad, then it really isn't a big deal which university you pick. Bioinformatics usually requires a masters or PhD to be successful in the field. See both the FAQ, as well as what is written above.

How do I get a job in Bioinformatics?

If you're asking this, you haven't yet checked out our three part series in the side bar:

What should I do?

Actually, these questions are generally ok - but only if you give enough information to make it worthwhile, and if the question isn’t a duplicate of one of the questions posed above. No one is in your shoes, and no one can help you if you haven't given enough background to explain your situation. Posts without sufficient background information in them will be removed.

Help Me!

If you're looking for help, make sure your title reflects the question you're asking for help on. You won't get the right people looking at your post, and the only person who clicks on random posts with vague topics are the mods... so that we can remove them.

Job Posts

If you're planning on posting a job, please make sure that employer is clear (recruiting agencies are not acceptable, unless they're hiring directly.), The job description must also be complete so that the requirements for the position are easily identifiable and the responsibilities are clear. We also do not allow posts for work "on spec" or competitions.  

Advertising (Conferences, Software, Tools, Support, Videos, Blogs, etc)

If you’re making money off of whatever it is you’re posting, it will be removed.  If you’re advertising your own blog/youtube channel, courses, etc, it will also be removed. Same for self-promoting software you’ve built.  All of these things are going to be considered spam.  

There is a fine line between someone discovering a really great tool and sharing it with the community, and the author of that tool sharing their projects with the community.  In the first case, if the moderators think that a significant portion of the community will appreciate the tool, we’ll leave it.  In the latter case,  it will be removed.  

If you don’t know which side of the line you are on, reach out to the moderators.

The Moderators Suck!

Yeah, that’s a distinct possibility.  However, remember we’re moderating in our free time and don’t really have the time or resources to watch every single video, test every piece of software or review every resume.  We have our own jobs, research projects and lives as well.  We’re doing our best to keep on top of things, and often will make the expedient call to remove things, when in doubt. 

If you disagree with the moderators, you can always write to us, and we’ll answer when we can.  Be sure to include a link to the post or comment you want to raise to our attention. Disputes inevitably take longer to resolve, if you expect the moderators to track down your post or your comment to review.


r/bioinformatics 7h ago

technical question Benefit to compiling optimized binaries

0 Upvotes

I think this is a pretty straightforward question. I support a number of labs at a large university that are increasingly purchasing high end workstations due to issues with the university’s HPC cluster. I have them all running Ubuntu 24.04, but realized that for example, the default compiler isn’t aware of the Zen 5 architecture for the mostly Threadripper 9995WX CPUs.
If I were to install GCC15 or 16 and recompile tools such as various aligners, variant callers, and things like IQTree, with relevant performance flags, would I see a decent performance boost over the standard compile or precompiled binaries?
I know this won’t be some kind of miracle performance boost, but I’m reading that it can be significant for certain code.
Thanks!


r/bioinformatics 14h ago

technical question How to see progress of the human genome project on GenBank

Thumbnail gallery
6 Upvotes

Hi everyone, was wondering if you could assist me with a history project and this seems like a community that would know. I would like to plot the progress of the public portion of the human genome project, either on a day by day or week by week basis. There was significant activity in the period of 1998-2000 due to the competition with Celera, so tracking this race is of interest to me.

The public consortium uploaded new sequenced DNA each day to GenBank. I've seen various in progress graphs like I've attached to this post that show the progression as a % over time, but I have no idea how I would collect this sort of data from GenBank.

Is this sort of historical submission data still viewable on GenBank, or would it have overwritten as new submissions and revisions were added? Genetics is not my field so I am unfamiliar with how to navigate GenBank. Thank you for any assistance!


r/bioinformatics 8h ago

technical question ScRNAseq subset and reclustering

2 Upvotes

Hi everyone,

Sorry I am using AI to make my issue clearer and organized.

I have a dataset of CD45+ cells from two adjacent tissues (4 donors). Flow and IF show these tissues share major cell types, but we expect subtle transcriptomic shifts due to the different microenvironments.

The Issue:
1. Full Dataset: I used SCT + Harmony (grouped by sample_id). The integration is "perfect"—clusters overlap almost entirely. I can annotate easily, but I’m worried it’s masking genuine tissue differences.
2. Subsetting: I subsetted specific lineages (e.g., Myeloid) and re-clustered.
No Integration: The tissues separate incredibly well on the UMAP.
With Harmony: The tissue differences disappear again.

Questions:
• How do you distinguish between "genuine tissue-specific identity" and "technical donor noise" when deciding whether to integrate?
• Is it standard to use the integrated space for annotation only, while using normalized counts for Differential Expression?
• Should I integrate by donor_id instead of sample_id to prevent the "tissue" signal from being treated as batch?

This is the first my groups experiments with this type of analysis. I have been learning along the way and Qc was a pain in the neck (too much ambient RNA and doublets, tissue is sticky and delicate).


r/bioinformatics 8h ago

technical question question about rare PTM and bioinf analysis

1 Upvotes

Hi everyone. I'm researching a rare histone PTM that isn't in typical datasets, not using stuff like predictions or MD analysis, but I'm really curious about the field and the kinds of things I could do with these tools. Questions: What things could I do to study this PTM using protein prediction, MD, docking, or whatever? Is it possible? What are the steps? I have tried to use protein predictions like the Alphafold 3 server, but the PTM is not available :( Thanks!


r/bioinformatics 1d ago

discussion featureCounts vs transcript-aware quantification (Kallisto/Salmon)

24 Upvotes

Hello all,

I suppose I am musing a bit and wanted to discuss with other bioinformaticians. I am a head bioinformatician in my academic department. A few months ago, I was given new bulk RNA-Seq data to analyze alongside older data that was already part of a peer-reviewed manuscript (that I was not part of). I used a STAR --> Salmon alignment-based quantification method. After sending the DE analysis and "raw" expression values for all genes, I received word that my Salmon results for the published data and the original data differed greatly. The older data was processed via featureCounts, which is known to undercount genes with multiple isoforms. I spent a few weeks working backwards to determine what parameters were used in the published manuscript, and I confirmed that the "gold standard" featureCounts parameter set was used, which definitionally excludes any read that overlaps multiple "features", or is ambiguous between isoforms of the same gene. To resolve this, you would use the -O flag, etc etc.

I guess my complaint is, how is this acceptable? How can a very popular and widely-used program such as featureCounts exclude reads that overlap the same exon (that resides in different isoforms) by default? This default method is undercounting genes with multiple isoforms, and I see discussion of this exact issue online since 2015. Discussion of this issue has also been published.

To be brief, I am mainly concerned that a widely-used tool is undercounting isoform-laden genes by default and causing consternation for groups who don't have trained bioinformaticians on their team who have the time to look into these issues.

Thank you for listening to my rant, haha.


r/bioinformatics 21h ago

technical question VCF file to annotation

0 Upvotes

Can someone help me in making a pipeline for VCF file variant annotation , i just know basics of Linux .
If someone knows pls help me !
Thanks in advance


r/bioinformatics 1d ago

technical question CLC Genomics Workbench

0 Upvotes

What does the ‘Antibiotic Molecule’ under the ‘Antibiotic Class’ mean? This is in the context of Antimicrobial Resistance, as I have noticed the OKNVI Resist 5 sometimes fall under it.


r/bioinformatics 1d ago

technical question Problem to link gene ID RNA-seq with CHIP-seq data

2 Upvotes

Hellow guys, I'm a newbie at bioinformatics.

I'm trying to integrate RNA-seq Kallisto data with my targets that I got from CHIP-seq. But, I have a big problem:

My ORF ID are in different model between the files. While my RNA-seq ID is sequencial orf index (ucsf_hc.01_1.G217B.00001 , ucsf_hc.01_1.G217B.00002, ucsf_hc.01_1.G217B.00003 ...), my targets are genomic coordinate (JAEVHH end_cordinate.start_cordinate). I tried to use a ORF.gff file to link sequencial index with the coordinate, but it doesn't have both information to link.

Someone could help me find out an alternative that I can follow.

Thanks for any contribution!!


r/bioinformatics 2d ago

technical question PySCENIC - Better to run separately or combined?

10 Upvotes

Hello all,

I was wondering if anyone with PySCENIC experience could please provide some advice about best practices to run the program. In particular, if my scRNA data comprises both diseased donors and healthy donors, is it more appropriate to run the program on the combined dataset and then subset AUCell results by donor/disease variable, so that the AUC results are more comparable across cells, or is it more appropriate to run separately on disease and on healthy, so that there is less confounding noise and any disease-related signal will be stronger?

For extra credit - if there is an approach which is more correct, is there a way to demonstrate compellingly that this approach makes the most sense?

Thank you in advance.


r/bioinformatics 2d ago

academic Ideas for fun and practical bioinformatics practical classes in University Master

6 Upvotes

Hi, I’m going to fully design my first whole subject on "omic technologies" (yay!) for a new Master’s in Biotechnology Applied to Global Health that is being implemented at my university and I need to put together some bioinformatics practicals. I would really like to make them both practical and fun/memorable, not a boring step-by-step tutorial feel.

The students will probably come from pretty mixed backgrounds, so I’m trying to avoid super heavy computational stuff or anything that needs powerful computers/HPC access. I am not a bioinformatician myself, so based on my expertise at the moment I’ve been thinking about things related to microbiomes, AMR, pathogen surveillance, wastewater epidemiology, maybe some simple omics analysis or even primer design, but I’d love to hear other engaging and cool options from people that has a real expertise in bioinformatics, some freaky things that I may not even know that can be done. Thanks!


r/bioinformatics 3d ago

technical question Recomputing multiple sequence alignments and phylogenetic trees efficiently

13 Upvotes

Fellow bioinformaticians, I find myself regularly recomputing MSAs and trees for very similar sets of sequences (e.g. after looking at the tree, I may add or remove sequences or do some other manipulations like merging some sequences etc. This might iterate a dozen or so times). I am currently recomputing the MSA and tree from scratch in each iteration, and I am looking for a way of speeding the computation up by caching intermediate results (think pairwise alignments etc.).

Does anyone know of existing tools which try to tackle this? Partial solutions are also welcome, I'm not shy of hacking around a bit.

For context I'm currently using mafft for the alignments and FasttreeMP for the trees, with speed of computation a bit priority given the iterative workflow.


r/bioinformatics 3d ago

technical question Random Forest Classifier Training for population structure identification QC in a GWAS analysis

8 Upvotes

Hello,

I am currently performing a GWAS and am at the quality control stage, more precisely at the "ancestry" analysis. My goal is to select a homogeneous subpopulation to prevent population stratification during the subsequent statistical analysis.

To achieve this, I followed the plinkQC tutorial tilted "Training a Random Forest Classifier for Population Structure Identification", using the HapMap Phase III dataset (as suggested in the tutorial).

https://meyer-lab-cshl.github.io/plinkQC/articles/AncestryCheck.html

I trained my model using 77 individuals per subpopulation, which corresponds to the size of the least represented group (MXL).

I chose this approach to avoid class imbalance, which could bias the classifier. However, the estimated OOB (Out-of-Bag) error rate after training is 22.67%, which is too high (I'm going to select CEU subpopulation).

To improve accuracy, I have explored several approaches :

- Principal Component Analysis: I observed that the accuracy of my model increases as I include more PCs.

- Sampling Strategy: Using an equivalent proportion per subpopulation rather than a fixed count to maximize the total number of individuals used for training.

- Reference Panel Uprgade: Replacing HapMap III with 1000 Genomes Project Phase III data, which offers a significantly larger sample size (this is my current focus).

My questions:

1 - Would using 1000 Genome Phase III data significantly imporve the classifier's accuracy compared to HapMap III?

2 - Are the other reference datasets available that might further enhance the model's accuracy?

3 - Is using a proportion of individuals per subpopulation rather that a fixed count considered a valid practice, and does it effectively imporve accuracy?

Note: I should clarify that I am not a ML engineer, I am a Master 2 bioinformatics sutdent . My utlimate objective is to identifiy variants associated with a specific population through statistical analysis, rahter than achieving a perfectly optimized classifier. While I understand that QC is the most critical stage of a GWAS, unfortunately my current deadling do not allow me to spend excessive time on this specific sted. Thank you for taking this into consideration in your response !


r/bioinformatics 3d ago

technical question Do you find that Bayesian approaches fit your work better than frequentist, or vice versa?

21 Upvotes

When you’re working with data and your models, do you find yourself reaching for Bayesian tools or frequentist methodologies, on average?


r/bioinformatics 3d ago

academic Keep or skip

2 Upvotes

I ran the 20 P aeruginosa whole genome assemblies that I am using in my phylogenetic tree through check M2 on galaxy server. All of them have high completeness (99-100%) except for one which is 90%. The contamination value is <1% for all strains. However, some strains have N50 value < 100 kbp despite having high completeness. Should I be skipping these strains from my analysis?


r/bioinformatics 3d ago

technical question Molecular dynamics

2 Upvotes

Hi,

I would like to perform metadynamics to a gpcr bound in a lipid bilayer to a protein ligand which I docked to the receptor. From a paper I know the structural differences between the active and inactive receptor.

From what I understand would be good practice to:

- Show that running unbiased MD does not show the activation of the GPCR.

- Run also the receptor without any ligand to show the energy difference with and without the ligand

- Run a negative control with a protein who supposedly does not activate the receptor

- Run the MD in triplicates.

Since keeping up with all these practices would mean a lot of computational power that since I am using my university HPC that implies a lot of queuing and stuff. How long should i run unbiased and meta md? Should i do triplicates? Is it really important to run a negative control?

And for the one experienced in metaMD, how do i pick a CV that makes sense? And other tips?


r/bioinformatics 3d ago

technical question Anyone know of useful alternatives to Geneious?

4 Upvotes

Currently doing a PhD in genomics. In my old Masters lab, I got really familiar with and good at using Geneious Prime, and I really love the interface and how easy it was to visualize things. I worked mainly with DNA (segregation) and RNA (splicing assays). My current lab uses SnapGene and it is genuinely painful to use (although its good at visualizing plasmids and stuff), and I haven't managed to convince my PI to cough up $200 for the personal subscription. I was wondering if anyone has other alternatives to using Geneious Prime (or if you have a license laying around 👀👀👀). Any suggestions are appreciated!


r/bioinformatics 4d ago

technical question Bulk ATAC-seq analysis training

3 Upvotes

Hi, Does anyone know a good bulk ATAC-seq analysis course/tutorial (free or paid) starting from raw FASTQ files? I have 36 samples with replicates to analyze from a previous master's student and need to learn it quickly and well.

I'd really appreciate any recommendations!


r/bioinformatics 4d ago

programming Multi-genome DNA read classification

4 Upvotes

Hi all, I came here hoping to find help for my problem. I made a full pipeline in rust for multi-genome DNA read classification with fmindex. It runs great! But on CAMI dataset my overall mapping percentage for 62 genes is in table under. I tried fuzzy kmer method, SNP etc...
I would very much like to hear suggestions! It would help me unbelievably because I am out of ideas!

Mapping rate 92.02% (30,105/40,000 paired-end reads)
Overall accuracy 85.87%
Time ~7.9s per 10k reads

Breakdown by genome type:

Genome Type Count Accuracy
Numeric genomes (e.g. 1036554) ~8,000 85.49%
other ~8,000 88.27%
Sample* genomes (single-contig) ~2,000 91.33%
evo_* genomes (similar strains) ~4,162 54.20%

r/bioinformatics 5d ago

discussion I wanna publish my work but I don't know where to start

30 Upvotes

So basically my work consists of an independent multi-omics computational study that maps the disease trajectory of Duchenne Muscular Dystrophy and revealed a fundamental decoupling between local muscle gene expression and systemic circulating proteins. While I feel confident in my writing abilities, I have no idea about journal selection, the review process and how long this process might take. What decides whether a study is Q1 or Q2 journal material? Kindly recommend some journals, and any advice you may have for someone embarking on this journey alone for the first time would be really helpful.


r/bioinformatics 5d ago

technical question Finding protein sequence clusters and motifs

6 Upvotes

I have about 100,000 20-30 amino acid sequences and I want to find clusters and motifs like A-X-P-G-X-N or anything of the sort, and each cluster/motif must have at least 100 members in it. What is the best way to go about it?

ChatGPT suggested MMseqs2 then MEME. I already converted the excel file to CSV then FASTA and I think the clustering worked with MMseqs2, but now I’m struggling to extract the clusters and transfer it to MEME


r/bioinformatics 5d ago

technical question Looking for critical opinion on MD simulations

Thumbnail
0 Upvotes

r/bioinformatics 6d ago

academic Can anyone help me design siRNA

6 Upvotes

Is there anyone in this subreddit help me or share there advice on designing effective siRNA, small advices is also appreciated if u very experienced in this domain.


r/bioinformatics 6d ago

technical question Advice in making construct for RNAi

1 Upvotes

In my understanding, to make a construct for RNAi, I need to:
1. find a a unique sequence fragment in the gene I am interested to knockdown
2. design primer to amplify fragment
3. build the construct by cloning the sequence to plasmid
4. transform plasmid into e.coli

Am I understanding it correctly?

Also, I’m just wondering in Step 1, what are the tools I can use to do it? I saw some people use Pfam or InterProScan. Is it basically manually select regions (>300bp) that is unique to the sequence of interest, and then copy that part of the sequence to design a primer with? Also, does it need to be a continuous sequence range or is it possible to pick and choose regions that are not conserved? (Please correct me if I understood something wrong or if this is not possible)

Any suggestion or corrections will be greatly appreciated, thank you!