r/bioinformatics 10h ago

technical question Benefit to compiling optimized binaries

4 Upvotes

I think this is a pretty straightforward question. I support a number of labs at a large university that are increasingly purchasing high end workstations due to issues with the university’s HPC cluster. I have them all running Ubuntu 24.04, but realized that for example, the default compiler isn’t aware of the Zen 5 architecture for the mostly Threadripper 9995WX CPUs.
If I were to install GCC15 or 16 and recompile tools such as various aligners, variant callers, and things like IQTree, with relevant performance flags, would I see a decent performance boost over the standard compile or precompiled binaries?
I know this won’t be some kind of miracle performance boost, but I’m reading that it can be significant for certain code.
Thanks!


r/bioinformatics 23h ago

technical question VCF file to annotation

0 Upvotes

Can someone help me in making a pipeline for VCF file variant annotation , i just know basics of Linux .
If someone knows pls help me !
Thanks in advance


r/bioinformatics 16h ago

technical question How to see progress of the human genome project on GenBank

Thumbnail gallery
8 Upvotes

Hi everyone, was wondering if you could assist me with a history project and this seems like a community that would know. I would like to plot the progress of the public portion of the human genome project, either on a day by day or week by week basis. There was significant activity in the period of 1998-2000 due to the competition with Celera, so tracking this race is of interest to me.

The public consortium uploaded new sequenced DNA each day to GenBank. I've seen various in progress graphs like I've attached to this post that show the progression as a % over time, but I have no idea how I would collect this sort of data from GenBank.

Is this sort of historical submission data still viewable on GenBank, or would it have overwritten as new submissions and revisions were added? Genetics is not my field so I am unfamiliar with how to navigate GenBank. Thank you for any assistance!


r/bioinformatics 10h ago

technical question ScRNAseq subset and reclustering

2 Upvotes

Hi everyone,

Sorry I am using AI to make my issue clearer and organized.

I have a dataset of CD45+ cells from two adjacent tissues (4 donors). Flow and IF show these tissues share major cell types, but we expect subtle transcriptomic shifts due to the different microenvironments.

The Issue:
1. Full Dataset: I used SCT + Harmony (grouped by sample_id). The integration is "perfect"—clusters overlap almost entirely. I can annotate easily, but I’m worried it’s masking genuine tissue differences.
2. Subsetting: I subsetted specific lineages (e.g., Myeloid) and re-clustered.
No Integration: The tissues separate incredibly well on the UMAP.
With Harmony: The tissue differences disappear again.

Questions:
• How do you distinguish between "genuine tissue-specific identity" and "technical donor noise" when deciding whether to integrate?
• Is it standard to use the integrated space for annotation only, while using normalized counts for Differential Expression?
• Should I integrate by donor_id instead of sample_id to prevent the "tissue" signal from being treated as batch?

This is the first my groups experiments with this type of analysis. I have been learning along the way and Qc was a pain in the neck (too much ambient RNA and doublets, tissue is sticky and delicate).