bioinf-2023-2024-introducti.../ex1-reading-fasta/description

The job is to read FASTA file and count how many A, C, T, G letters are in an example sequence.

FASTA one can get from the internet. We can use e.g. wiki page to get one sequence:

https://en.wikipedia.org/wiki/FASTA_format

The FASTA format with one sequence for an elephant looks like this:

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY


The FASTA format has to separate parts:

1. header
2. sequence


Thus, the excercise for reading FASTA file, in C, briefly looks like this:

1. Select sequence and save it to a file, e.g. seq.fasta

2. Open the file seq.fasta in C with fopen:

https://cplusplus.com/reference/cstdio/fopen/

Tip: Use correct mode.

3. Read the file, line by line, using e.g. snippet of code from here:

https://cplusplus.com/forum/beginner/22558/

4. Check if the lines starts with ';' or with '>', then skip such a line

If the line is OK, then count how many A, C, T, G are in the given line and
save that information.

In order to check if the line starts with a given character you can
simply use index of a string (line[0] - this is first character in the
array line).

The end.