I'd like to quote from Speech and Language Processing: An introduction to natural language processing:
For named entities, the entity rather than the word is the unit of response.
In your case, the First Bank of Chicago should count as a single response, and it should be predicted as ORG ORG ORG ORG as a whole, otherwise the whole is wrong/false(either false positive or false negative).
If the predicted BIO tags are O B-ORG I-ORG I-ORG, it indicates a boundary error, and the whole is false and then O is false positive and B-ORG I-ORG I-ORG is false negative, two demerits.
However, if the guess tags are O O O O it is just a labeling error and there is only one demerit: one false positive.
In this article: Doing Named Entity Recognition? Don't optimize for F1, Christ Manning stated that the F1 encourages the model to guess all as O if it is not sure because boundary errors and label-boundary errors are more costly.
Side note:
an implement of entity-level F1 score: https://github.com/jantrienes/nereval