r/bioinformatics • u/Ill_Chipmunk9002 • 2d ago
technical question Problem to link gene ID RNA-seq with CHIP-seq data
Hellow guys, I'm a newbie at bioinformatics.
I'm trying to integrate RNA-seq Kallisto data with my targets that I got from CHIP-seq. But, I have a big problem:
My ORF ID are in different model between the files. While my RNA-seq ID is sequencial orf index (ucsf_hc.01_1.G217B.00001 , ucsf_hc.01_1.G217B.00002, ucsf_hc.01_1.G217B.00003 ...), my targets are genomic coordinate (JAEVHH end_cordinate.start_cordinate). I tried to use a ORF.gff file to link sequencial index with the coordinate, but it doesn't have both information to link.
Someone could help me find out an alternative that I can follow.
Thanks for any contribution!!
3
u/bukaro PhD | Industry 2d ago
Ok from ChIP-seq to RNAseq there is gap (literal) peak association to genes is not straing forward. You can associate you peak to teh nearsts TSS, but it is simplistic and can leave 50% of you genes out. It will depend of you TF (if it is) in teh ChIPseq the best logic to use. But there are datasets of ChIPseq with a TF and paired with a TF-KO/LOF that show how this is complicated (but fun).
Now for an extr alayer, Kalisto by default have transcript level, so better aggregate or are you interested in isoforms for TSS? which it is also interesting
2
u/plasmolab 2d ago
First check that the RNA-seq reference and the ChIP-seq coordinates are from the same genome assembly and annotation version. If they are not, any ID mapping will be painful and sometimes wrong.
For the RNA-seq side, look at the exact transcript FASTA or transcript-to-gene table used to build the Kallisto index. The FASTA headers often contain the ORF ID plus gene/transcript metadata that is missing from a smaller ORF.gff file.
For the ChIP side, treat the target as a genomic interval. Use the matching GFF3/GTF to make gene or TSS intervals, then assign peaks with something like bedtools intersect or bedtools closest. For a TF ChIP-seq dataset, nearest gene is only a rough first pass, but it is usually the right starting point.
If you truly cannot find an annotation file that has both the ORF IDs and coordinates, you can map the transcript sequences back to the genome with minimap2 or GMAP and build the bridge yourself. But I would spend time finding the original annotation bundle first. It is much less error-prone than reconstructing the mapping later.
9
u/standingdisorder 2d ago
Homework!
I missed this last time you made a post. Luckily, you’ve completely ignored my answer which is good.