Everyone today wants to detect disease early on, but because there aren't many patterns for the many diseases available, it's hard to do so. Because DNA sequences contain all the genetic data about organisms, which can be utilised by researchers to discover or treat diseases early on by developing new medications, using DNA sequences to extract patterns of disease can be very advantageous. The largest global collection of genomic sequences is made available by NCBI, but today the biggest worry is how to protect this enormous amount of data. One of the options is to encrypt these genetic sequences using blockchain technology. As a result, a study of the number of studies in this area as well as the demand for blockchain in healthcare has been conducted in this chapter. Additionally, surveys about research done in the field of DNA sequence classification are suggested along with recommendations for using classification of DNA sequences to detect disease earlier.
TopIntroduction
The analysis of DNA sequences is regarded as a crucial component in biological science since it contains genomic information that may be used by scientists and medical professionals to predict diseases before they manifest. The largest global collection of genomic sequences is made available by NCBI, but today the biggest worry is how to protect this enormous amount of data. One of the options is to encrypt these genetic sequences using blockchain technology. Here, a survey has been conducted on the frequency of healthcare data breaches, the demand for blockchain in the industry, and the number of studies conducted in this area.
The graph below displays the total number of publications relating to research on healthcare from 2000 to 2020, and it makes obvious how crucial it is to understand the healthcare system and its security.
Figure 1. A graph on the total number of publications related to studies on healthcare from 2000-to 2020
NCBI which is well known as “Genbank” is the biggest database of genome sequences in the world with billions of nucleoid bases and millions of distinct DNA sequences D. A. Benson et al. (2010). On the other hand, the cost of genome sequencing has fundamentally decreased after a greater advancement in techniques used for reading DNA sequences Ngoc Giang Nguyen et al. (2016), year-wise distribution of sequences storage in Genbank (NCBI) from 2013- to 2022 can be seen in figure 2.
Figure 2. Year-wise distribution of sequences in Genbank.
Security is more important when the user holds their health records and also when this data has to be transmitted between doctors or researchers. Security problems can get resolved using blockchain technology, which has already proven itself in the field of transactions and now gaining importance in healthcare systems. In simple terms, we can say that a blockchain is a chain of encrypted records (known as blocks) that is secured using the traditional SHA 256. DNA data collected from labs and hospitals are securely stored using blockchain so that they can’t get changed or attacked by an intruder. The researchers who have authorized access can use this data for classification purposes.
TopBackground
In this paper section, First part highlights the DNA data composition and its file format. In second and third part a brief introduction to blockchain technology and DNA classification has been given. In section 3 need for blockchain in healthcare is shown. Section 4 gives a detailed literature review on work done in the field of blockchain in healthcare as well as in the field of DNA sequence classification. Finally, section 5 concludes the paper.
DNA Data Composition
The study of DNA sequence is crucial for comprehending organisms since it is well-known that biological information about an organism or human is stored in its DNA. Figure 3 depicts the DNA's fundamental structure.
Figure 3. The basic structure of DNA.
FASTA is the textual representation of DNA sequence (Mohammed et al. 2012 and Aste et al. 2017), that uses only single letters for representing the DNA base pairs as depicted in table 1.
Table 1. A | Adenosine |
C | Cytoline |
G | Guanine |
T | Thymidine |
N | Can be any of these (A,C,G,T) |
The order in which nucleotides are arranged represents the layout of any organism. In FASTA representation the line starts with the description of the DNA sequence i.e. identifier followed by the sequence data as shown in the below example.
>U02993.1 Human cytochrome P450 (Cyp1A2) gene, 5' region
GGTACCTTGAGAAAGGAACACAACAGGGACTTCTTGGATGCTTATGATGTCCTTGATTAGAGCTGGTTA
For separating DNA sequence and identifier a special symbol “>” is used. The part of a word that follows the symbol “>” denotes the sequence's identification, and the remaining text, which is optional and isolates the identifier from a white space or tab, denotes the sequence's description. After the text line, the real DNA sequence will begin on the line that follows. The second line starts with another “>” sign, denoting the end of the sequence and the start of a new sequence.