2024-04-26 08:54:47 +02:00
|
|
|
The job is to read FASTA file and count how many A, C, T, G letters are in an example sequence.
|
|
|
|
|
|
|
|
FASTA one can get from the internet. We can use e.g. wiki page to get one sequence:
|
|
|
|
|
|
|
|
https://en.wikipedia.org/wiki/FASTA_format
|
|
|
|
|
|
|
|
The FASTA format with one sequence for an elephant looks like this:
|
|
|
|
|
|
|
|
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
|
|
|
|
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
|
|
|
|
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
|
|
|
|
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
|
|
|
|
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
|
|
|
|
IENY
|
|
|
|
|
|
|
|
|
|
|
|
The FASTA format has to separate parts:
|
|
|
|
|
|
|
|
1. header
|
|
|
|
2. sequence
|
|
|
|
|
|
|
|
|
2024-04-26 09:21:33 +02:00
|
|
|
Thus, the excercise for reading FASTA file, in C, briefly looks like this:
|
2024-04-26 08:54:47 +02:00
|
|
|
|
|
|
|
1. Select sequence and save it to a file, e.g. seq.fasta
|
|
|
|
|
|
|
|
2. Open the file seq.fasta in C with fopen:
|
|
|
|
|
|
|
|
https://cplusplus.com/reference/cstdio/fopen/
|
|
|
|
|
2024-04-26 09:21:33 +02:00
|
|
|
Tip: Use correct mode.
|
|
|
|
|
|
|
|
3. Read the file, line by line, using e.g. snippet of code from here:
|
|
|
|
|
|
|
|
https://cplusplus.com/forum/beginner/22558/
|
|
|
|
|
|
|
|
4. Check if the lines starts with ';' or with '>', then skip such a line
|
|
|
|
|
|
|
|
If the line is OK, then count how many A, C, T, G are in the given line and
|
|
|
|
save that information.
|
|
|
|
|
|
|
|
In order to check if the line starts with a given character you can
|
|
|
|
simply use index of a string (line[0] - this is first character in the
|
|
|
|
array line).
|
|
|
|
|
|
|
|
The end.
|
|
|
|
|