Skip to main content
added 1010 characters in body; added 58 characters in body
Source Link
Tendayi Mawushe
  • 26.2k
  • 7
  • 53
  • 57

This cannot be done in a reliable manner and that is not due to limitationlimitations in Python or any other programming language for that matter. A human being could not do this in a predictable manner without guessing and following a few rules (usually called Heuristics when used in this context).

  • All the values are valid strings we know that because that is the basis of our problem so there is notno point in checking thatfor this at all. We should check everything else we can what everwhatever falls through we can just leave as a string.
  • Dates are the most obvious thing to check first if they are formatted in predictable manner such as [YYYY]-[MM]-[DD]. (ISO ISO 8601 date format) they are easy to distinguish from other bits of text thethat contain numbers. If the dates are in a format with just numbers like YYYYMMDD then we are stuck as these datedates will be indistinguishable from ordinary numbers.
  • We will do integers next because all integers are valid floats but not all floats are valid integers. We could just check if the text contains on digits (or digits and the letters A-F if hexadecimal numbers are possible) in this case treat the value as an integer.
  • Floats would be next as they are numbers with some formatting (the decimal point). It is easy to recognise 3.14159265 as a floating point number. However 5.0 which can be written simply as 5 is also a valid float but would fall through filterhave been caught in the previous steps and not be caughtrecognised as a float even if it was intended to be.
  • Finally we could just check if the text contains on digits (or digits and the letters A-F if hexadecimal numbers are possible) in this case treat the value an integer.
  • Any values that are left unconverted can be treated as strings.

Due to the possible overlaps I have mentioned above such a scheme can never be 100% reliable. Also any new data type that you need to handlesupport (complex number perhaps) would need its own set of heuristics and would have to placed in the most appropriate place in the chain of checks. The more likely a check is to match only the data type desired the higher up the chain it should be.

Now lets make this real in Python, most of the heuristics I mentioned above are taken care of for us by Python we just need to decide on the order in which to apply them:

from datetime import datetime heuristics = (lambda value: datetime.strptime(value, "%Y-%m-%d"), int, float) def convert(value): for type in heuristics: try: return type(value) except ValueError: continue # All other heuristics failed it is a string return value values = ['3.14159265', '2010-01-20', '16', 'some words'] for value in values: converted_value = convert(value) print converted_value, type(converted_value) 

This outputs the following:

3.14159265 <type 'float'> 2010-01-20 00:00:00 <type 'datetime.datetime'> 16 <type 'int'> some words <type 'str'> 

This cannot be done in a reliable manner and that is not due to limitation in Python or any other programming language. A human being could not do this in a predictable manner without guessing and following a few rules (usually called Heuristics when used in this context).

  • All the values are valid strings we know that because that is the basis of our problem so there is not point checking that at all. We should check everything else we can what ever falls through we can just leave a string.
  • Dates are the most obvious thing to check first if they are formatted in predictable manner such as [YYYY]-[MM]-[DD]. (ISO ISO 8601 date format) they are easy to distinguish from other bits of text the contain numbers. If the dates are in a format with just numbers like YYYYMMDD then we are stuck as these date will be indistinguishable from ordinary numbers.
  • Floats would be next as they are numbers with some formatting (the decimal point). It is easy to recognise 3.14159265 as a floating point number. However 5.0 which can be written simply as 5 is also a valid float but would fall through filter and not be caught as a float even if it was intended to be.
  • Finally we could just check if the text contains on digits (or digits and the letters A-F if hexadecimal numbers are possible) in this case treat the value an integer.
  • Any values that are left unconverted can be treated as strings.

Due to the possible overlaps I have mentioned above such a scheme can never be 100% reliable. Also any new data type that you need to handle (complex number perhaps) would need its own set of heuristics and would have to placed in the most appropriate place in the chain of checks. The more likely a check is to match only the data type desired the higher up the chain it should be.

This cannot be done in a reliable manner and that is not due to limitations in Python or any other programming language for that matter. A human being could not do this in a predictable manner without guessing and following a few rules (usually called Heuristics when used in this context).

  • All the values are valid strings we know that because that is the basis of our problem so there is no point in checking for this at all. We should check everything else we can whatever falls through we can just leave as a string.
  • Dates are the most obvious thing to check first if they are formatted in predictable manner such as [YYYY]-[MM]-[DD]. (ISO ISO 8601 date format) they are easy to distinguish from other bits of text that contain numbers. If the dates are in a format with just numbers like YYYYMMDD then we are stuck as these dates will be indistinguishable from ordinary numbers.
  • We will do integers next because all integers are valid floats but not all floats are valid integers. We could just check if the text contains on digits (or digits and the letters A-F if hexadecimal numbers are possible) in this case treat the value as an integer.
  • Floats would be next as they are numbers with some formatting (the decimal point). It is easy to recognise 3.14159265 as a floating point number. However 5.0 which can be written simply as 5 is also a valid float but would have been caught in the previous steps and not be recognised as a float even if it was intended to be.
  • Any values that are left unconverted can be treated as strings.

Due to the possible overlaps I have mentioned above such a scheme can never be 100% reliable. Also any new data type that you need to support (complex number perhaps) would need its own set of heuristics and would have to placed in the most appropriate place in the chain of checks. The more likely a check is to match only the data type desired the higher up the chain it should be.

Now lets make this real in Python, most of the heuristics I mentioned above are taken care of for us by Python we just need to decide on the order in which to apply them:

from datetime import datetime heuristics = (lambda value: datetime.strptime(value, "%Y-%m-%d"), int, float) def convert(value): for type in heuristics: try: return type(value) except ValueError: continue # All other heuristics failed it is a string return value values = ['3.14159265', '2010-01-20', '16', 'some words'] for value in values: converted_value = convert(value) print converted_value, type(converted_value) 

This outputs the following:

3.14159265 <type 'float'> 2010-01-20 00:00:00 <type 'datetime.datetime'> 16 <type 'int'> some words <type 'str'> 
Source Link
Tendayi Mawushe
  • 26.2k
  • 7
  • 53
  • 57

This cannot be done in a reliable manner and that is not due to limitation in Python or any other programming language. A human being could not do this in a predictable manner without guessing and following a few rules (usually called Heuristics when used in this context).

So lets first design a few heuristics then encode them in Python. Things to consider are:

  • All the values are valid strings we know that because that is the basis of our problem so there is not point checking that at all. We should check everything else we can what ever falls through we can just leave a string.
  • Dates are the most obvious thing to check first if they are formatted in predictable manner such as [YYYY]-[MM]-[DD]. (ISO ISO 8601 date format) they are easy to distinguish from other bits of text the contain numbers. If the dates are in a format with just numbers like YYYYMMDD then we are stuck as these date will be indistinguishable from ordinary numbers.
  • Floats would be next as they are numbers with some formatting (the decimal point). It is easy to recognise 3.14159265 as a floating point number. However 5.0 which can be written simply as 5 is also a valid float but would fall through filter and not be caught as a float even if it was intended to be.
  • Finally we could just check if the text contains on digits (or digits and the letters A-F if hexadecimal numbers are possible) in this case treat the value an integer.
  • Any values that are left unconverted can be treated as strings.

Due to the possible overlaps I have mentioned above such a scheme can never be 100% reliable. Also any new data type that you need to handle (complex number perhaps) would need its own set of heuristics and would have to placed in the most appropriate place in the chain of checks. The more likely a check is to match only the data type desired the higher up the chain it should be.