Prokaryote: Chlamydophila pneumoniae CWL029

Prokaryote			Protein	ID	Desc				NC Upstream DNA
-----------------------------------------------------------------------------------------------


Protein group 1:

Chlamydophila pneumoniae CWL029	15618801	Alanyl tRNA Synthetase		232

Chlamydia trachomatis		15605482	Alanyl tRNA Synthetase		5826

Escherichia coli K12		16130604	alanyl-tRNA synthetase		128

Pseudomonas aeruginosa PA01	15596100	alanyl-tRNA synthetase		121

Mesorhizobium loti		13470354 	alanyl-tRNA synthetase		260

Haemophilus influenzae Rd	16272755	alanyl-tRNA synthetase 		187


Protein Group 2:

Chlamydophila pneumoniae CWL029	15618077	Leucyl tRNA Synthetase		167

Chlamydia trachomatis		15604929	Leucyl tRNA Synthetase		196

Lactococcus lactis subsp lactis 15672798	leucyl-tRNA synthetase		70

Bacillus halodurans		15615843	leucyl-tRNA synthetase		519

Bacillus subtilis		16080084 	leucyl-tRNA synthetase		426

Streptococcus pyogenes		15674378 	putative leucyl-tRNA synthetase	174


Protein Group 3:

Chlamydophila pneumoniae CWL029	15618411 	Prolyl tRNA Synthetase		299

Chlamydia trachomatis		15605118 	Prolyl tRNA Synthetase		200

Thermotoga maritima		15643280	prolyl-tRNA synthetase		71

Escherichia coli O157:H7 EDL933	15799876	proline tRNA synthetase		110

Pseudomonas aeruginosa PA01	15596153	prolyl-tRNA synthetase		107

Escherichia coli K12,		16128187 	proline tRNA synthetase		111

Pasteurella multocida		15603235	ProS				242

Procedure for finding each of the obove protein groups:

I developed a heuristic for choosing COGs that would be likely to have a high number of closely related (low e-value) proteins, and also likely to have those proteins with at least 50 bp of non-coding upstream DNA. I used this procedure to find all 3 protein groups above. The heuristic is as follows:

From the COG list, choose COGs that contain 4 eukaryotes and 40-50 prokaryotes. Further refine this group by choosing only the COGs that contain proteins from each type of prokaryotic organism. Finally, from this group choose COGs that have "synthetase" in the COG description. At this point, I took a random sample of the remaining COGs and kept the 3 with the lowest e-values for the first 10 proteins.