Hopefully some lab will figure out a way to let people collaborate by publishing the raw test data and figure out a way to connect people. And start testing for all the major cannabanoids,
terpenes, etc. Analytic360 and a couple others are on the right path. Hopefully they will make the public dataset available as raw XML. Two problems with gas chromatography are..... no standardized testing protocol between labs... and environmental factors affecting the lab test numbers (date of harvest, outdoor vs indoor, latitude grown, etc etc etc. )
Here is an database algorithm to find similar strains/matches so that strains high in THCV and CBG could be "paired" for a breeding project. Kind BioInformatics :)
SQL Tables:
src
dst
Simplified SQL Fields:
In reality we have 100+ attributes, but to keep example data simple I am only am using three attributes (THCV, Limonene, CBG)
src.SampleName as nvarchar
dst.THCV as number(3,2)
dst.
Limonene as number(3,2)
dst.CBG as number(3,2)
Example Records:
In reality we would have tens of thousands of samples, but to keep it simple here's six records with totally bogus numbers.
StrainSampled, THCV, CBG
OhNoNotTheATF#1, 01.11%, 2.28%, 04.21%
NorthernLights, 21.12%, 2.21%, 04.21%
BlueDream, 11.12%, 2.21%, 04.21%
FishSticks, 02.10%, 3.11%, 04.21%
Deathstar, 12.11%, 3.21%, 05.21%
WheresMyCarKeys , 10.11%, 3.21%, 05.21%
Task:
-We need to find all samples that have a similar "ratio" of attributes (THCV,
Limonene, CBG) within a threshold window of three "percentage points" variance. This will give us the plants that need to be bred together (with hopes that the offspring will have even higher numbers we are looking for). I used three percentage points as a fudge factor
to take into account testing differences between samples (in other words lab error). This number can be changed as necessary.
OhNoNotTheATF#1 should match with Deathstar because all the attributes (THCV,
Limonene, CBG) are "similar" with less than 3 percentage points differing between OhNoNotTheATF and FishSticks.
BlueDream and Deathstar and WheresMyCarKeys should match because all the attributes are similar with less than three percentage points difference.
NorthernLights won't match anything because it's test numbers are radically different.
Here is resulting set from database:
Group1 for breeding for high ABC
OhNoNotTheATF#1, 01.11%, 2.28%, 04.21%
FishSticks , 21.12%, 2.21%, 04.21%
Group2 for breeding for high XYZ
BlueDream #Two, 11.12%, 2.21%, 04.21%
Deathstar, 12.11%, 3.21%, 05.21%
WheresMyCarKeys, 10.11%, 3.21%, 05.21%
DB Algorithm (ORACLE PSQL language):
(you can change the number 3 to 0.3 or whatever for less variance in test discrepancy
or find good partners)
SELECT src.SampleName, dst.*
FROM dst src
INNER JOIN SampleName dst on ABS(src.THCV - dst.THCV)< 3 AND ABS(src.Limonene - dst.Limonene)< 3 AND ABS(src.CBG - dst.CBG)< 3
ORDER BY 1 asc;
More Info on ABS (absolute) function:
http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions002.htm
Data Sharing:
All gas chromatography machines have the ability to export their data, and there is an open standard format to facilitate dataset sharing and collaboration.
http://tools.proteomecenter.org/wiki...=Formats:mzXML
mzXML is an open data format for storage and exchange of mass spectroscopy data, developed at the SPC/Institute for Systems Biology. mzXML provides a standard container for ms and ms/ms proteomics data and is the foundation of our proteomic pipelines. Raw, proprietary file formats from most vendors can be converted to the open mzXML format.