1

I have a very large set of strings for which I want to find the subset of unique strings and I am using the set container. The methods go out to a MySQL database, pull in a new group of strings and tries to add them to a set. I check the return from the insert to determine if the string was added (first occurrence) or it is already present.

#include <iostream> #include <string> #include <fstream> #include <algorithm> #include <vector> #include <iostream> #include "CDR3Sample.h" #include "MySQLConnect.h" using namespace std; int main() { CDR3SetReturn ret; //CDR3Set is a typedef on set<string> CDR3Set total; try { MySQLConnect connection; cerr << "size of master " << connection.getMasterSize() << endl; SampleIDList list = connection.getSampleIDList(); SampleIDList ids_seen; cerr << "size of raw ID list " << list.size() << endl; for (SampleIDListIterator it=list.begin(); it != list.end(); it++) { // We're going to skip it if the table doesn't exist or if the sample has already been processed if (connection.checkTable(*it) && find(ids_seen.begin(), ids_seen.end(), *it)!=list.end()) { CDR3Sample s(*it, connection); int valid_number = 0; for (CDR3SetIterator sit=s.begin(); sit != s.end(); sit++) { ret = total.insert(*sit); if (ret.second) { valid_number++; } } cout << *it << " " << s.getLength() << " " << valid_number << " " << total.size() << endl; ids_seen.push_back(*it); } else { cerr << *it << " table not found" << endl; } } } catch (int i) { // Need to put code here to save state of calculation std::cerr << "Exception thrown by MySQLConnect " << i << std::endl; exit(-1); } // Need to put code here to save state of calculation cerr << "size of total " << total.size() << endl; ofstream ofs ("cdr3_tally.test", ifstream::out); int it_count=0; while (ofs.good()) { for (CDR3SetIterator it=total.begin(); it != total.end(); ++it) { cout << it_count << " " << *it << endl; it_count++; } } ofs.close(); cerr << "it_count " << it_count << endl; ofs_naive.close(); return 0; } 

I'll leave the supporting code out for brevity, but I can provide it.

When it gets to the end, it has the correct number of entries:

size of master 9243 size of raw ID list 1 ~MySQLConnect size of total 372 

But the loop that write out the set just keeps going and going for millions of lines. If I use sort -u on the output, it has the correct number of entries.

I am stumped. The code looks OK to me. It's not the complicated.

Can anyone see something that I have done wrong? Should I make a formal class out of CDR3Set instead of a typdef?

I am using g++ on ubuntu

$ g++ -v Using built-in specs. COLLECT_GCC=g++ COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.8/lto-wrapper Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.8.1-2ubuntu1~12.04' --with-bugurl=file:///usr/share/doc/gcc-4.8/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.8 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.8 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.8-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.8.1 (Ubuntu 4.8.1-2ubuntu1~12.04)

Thanks

Mike

7
  • Not sure if this has anything todo with it but CDR3SetIterator it=total.begin(); is bad practice you should make the iterator outside of the loops. Commented Aug 12, 2014 at 20:34
  • I think while (ofs.good()) you must change to if (ofs.good()) Commented Aug 12, 2014 at 20:45
  • I just changed the output to Commented Aug 12, 2014 at 20:50
  • Changing 'while' to 'if' and taking the iterator definition out of the loop solved the problem. Thanks. I also found a construct std::ostream_iterator< double > output( cout, " " ); cout << "doubleSet contains: "; std::copy( doubleSet.begin(), doubleSet.end(), output ); that worked Commented Aug 12, 2014 at 20:54
  • @Camron_Godbout Why is that (iterators should not be created in the for header)? Commented Aug 12, 2014 at 21:02

1 Answer 1

2

Your cout for loop is enclosed in while(ofs.good()). Nothing inside the for loop will ever make it bad, so it keeps looping over the set and printing everything again and again.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks. Sorry I didn't catch that.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.