Skip to content

Conversation

@medullaskyline
Copy link

@medullaskyline medullaskyline commented May 5, 2016

@jreback
Copy link
Contributor

jreback commented May 5, 2016

you need tests for this and an example of why you think this is broken

@codecov-io
Copy link

codecov-io commented May 5, 2016

Current coverage is 84.14%

Merging #13087 into master will decrease coverage by <.01%

@@ master #13087 diff @@ ========================================== Files 138 137 -1 Lines 50724 50289 -435 Methods 0 0 Messages 0 0 Branches 0 0 ========================================== - Hits 42726 42312 -414  + Misses 7998 7977 -21  Partials 0 0 

Powered by Codecov. Last updated by 103f7d3...4a30557

@medullaskyline
Copy link
Author

I hope this comment provides a clearer example.

@jreback
Copy link
Contributor

jreback commented May 6, 2016

no please show a run of this and the error you get with master

@medullaskyline
Copy link
Author

medullaskyline commented May 6, 2016

Thanks for your patience! When I run this...

df_to_append = pd.DataFrame({'date': {pd.Timestamp('2016-05-04 00:00:00', offset='D'): pd.Timestamp('2016-05-04 00:00:00')}, 'new_nih_users': {pd.Timestamp('2016-05-04 00:00:00', offset='D'): 0.0}, 'new_web_users': {pd.Timestamp('2016-05-04 00:00:00', offset='D'): 6.0}, 'returning_nih_users': {pd.Timestamp('2016-05-04 00:00:00', offset='D'): 2.0}, 'returning_web_users': {pd.Timestamp('2016-05-04 00:00:00', offset='D'): 5.0}}) df_to_append.to_gbq(dataset + '.' + table, project, if_exists='append') 

... I get an InvalidSchema error. Traceback:

--------------------------------------------------------------------------- InvalidSchema Traceback (most recent call last) <ipython-input-57-704709b67558> in <module>() 8 'returning_nih_users': {pd.Timestamp('2016-05-04 00:00:00', offset='D'): 2.0}, 9 'returning_web_users': {pd.Timestamp('2016-05-04 00:00:00', offset='D'): 5.0}}) ---> 10 df_to_append.to_gbq(dataset + '.' + table, project, if_exists='append') /Library/Python/2.7/site-packages/pandas/core/frame.pyc in to_gbq(self, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key) 896 return gbq.to_gbq(self, destination_table, project_id=project_id, 897 chunksize=chunksize, verbose=verbose, reauth=reauth, --> 898 if_exists=if_exists, private_key=private_key) 899 900 @classmethod /Library/Python/2.7/site-packages/pandas/io/gbq.pyc in to_gbq(dataframe, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key) 715 elif if_exists == 'append': 716 if not connector.verify_schema(dataset_id, table_id, table_schema): --> 717 raise InvalidSchema("Please verify that the column order, " 718 "structure and data types in the " 719 "DataFrame match the schema of the " InvalidSchema: Please verify that the column order, structure and data types in the DataFrame match the schema of the destination table. 

The generated schema of df_to_append is

{'fields': [{'name': 'date', 'type': 'TIMESTAMP'}, {'name': 'new_nih_users', 'type': 'FLOAT'}, {'name': 'new_web_users', 'type': 'FLOAT'}, {'name': 'returning_nih_users', 'type': 'FLOAT'}, {'name': 'returning_web_users', 'type': 'FLOAT'}]} 

and the schema of the BigQuery table I want to append to is

{u'fields': [{u'mode': u'NULLABLE', u'name': u'date', u'type': u'TIMESTAMP'}, {u'mode': u'NULLABLE', u'name': u'new_web_users', u'type': u'FLOAT'}, {u'mode': u'NULLABLE', u'name': u'returning_web_users', u'type': u'FLOAT'}, {u'mode': u'NULLABLE', u'name': u'new_nih_users', u'type': u'FLOAT'}, {u'mode': u'NULLABLE', u'name': u'returning_nih_users', u'type': u'FLOAT'}]}``` 

therefore the generated schema of the dataframe does not equal the schema of the BigQuery table I want to append the dataframe to.

@jreback
Copy link
Contributor

jreback commented May 6, 2016

how is that a bug? looks like user error to me
you have to match the schema of the existing table
how did you create the original table?

@medullaskyline
Copy link
Author

medullaskyline commented May 6, 2016

how is that a bug? looks like user error to me

The reason I think it is a bug is because the generated schema for dataframes comes from this function. It only adds name and type keys for each field.

you have to match the schema of the existing table

Yes, but currently in the verify_schema method, it reads the schema of the BigQuery table using a apiclient.discovery.Resource object called service like this:

self.service.tables().get( projectId=self.project_id, datasetId=dataset_id, tableId=table_id ).execute()['schema'] 

which will yield a schema just like the generated schema from my dataframe but with the extra keys mode and description for each field. Documentation for Google's BigQuery python API here shows that the schema returned has description and mode in addition to name and type.

how did you create the original table?

From an ipython notebook -- generating it using the dictionary I provided above gives the same result.

@medullaskyline
Copy link
Author

The documentation says

The dataframe must match the destination table in column order, structure, and data types.

but it does not specify that the dataframe's columns must have the same "description" or "mode" as the BigQuery table's columns. It is possible to generate "name" and "type" keys for each column in a dataframe but not "description" or "mode".

That's why I think verify_schema should get the destination table's schema using the BigQuery service API and then remove "description" and "mode" from that BigQuery table schema.

@parthea
Copy link
Contributor

parthea commented May 6, 2016

Thanks for the clear explanation of the issue. I can understand how the schema verification could fail. I see that mode and description are optional fields according to the BigQuery docs. I agree with ignoring the description field in the schema verification.

I don't think the mode field should be ignored, especially when it is set to REPEATED. REPEATED fields can be thought of as arrays of values. See here for more information about the REPEATED option. The default value for mode is NULLABLE. Perhaps, we only support the NULLABLE and REQUIRED option ?

screenshot from 2016-05-05 21 30 36

@medullaskyline
Copy link
Author

medullaskyline commented May 6, 2016

Looks like I forgot to include fields as another field key to be ignored. My mistake.

I agree that the mode field has important information.

Instead of stripping description, mode and fields from the verify_bq_schema method, do you think it would be better to include mode and fields in _generate_bq_schema (in addition to name and type)?

I see that generate_bq_schema will be removed in a future version, so I don't know what the best option is.

@parthea
Copy link
Contributor

parthea commented May 6, 2016

Based on the BigQuery docs and my understanding of the implementation, pandas currently doesn't support a BigQuery schema with any of the following options:

  • mode is set to REPEATED (field is an arrays of values)
  • mode is set to REQUIRED (pandas doesn't check for nulls, so exception would be raised by BigQuery API if null is found)
  • type is set to RECORD (nested schema defined in fields)

Based on the above, I would prefer the following:

  • Update the pandas documentation (or schema mismatch error message) to indicate that only mode with default value of NULLABLE is supported.
  • Ignore the description field when verifying schema
  • In generate_bq_schema, set mode to the default which is NULLABLE
  • Any other schema differences should be raised to the user until we can add support in pandas for the more complicated options.
@jreback jreback added this to the 0.18.2 milestone May 6, 2016
@jreback jreback added API Design Compat pandas objects compatability with Numpy or Python functions and removed API Design labels May 6, 2016
@jreback jreback removed this from the 0.18.2 milestone May 6, 2016
@jreback
Copy link
Contributor

jreback commented May 6, 2016

if some of the other contributors can comment.

cc @jacobschaer
cc @sean-schaefer
cc @aaront

this would need some tests of where it should suceed (append is ok given various states of the existing table), and fail (where it should raise)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

give a much simpler message that a user reading will grok immeditely.

@medullaskyline
Copy link
Author

medullaskyline commented May 6, 2016

@parthea

  • Update the pandas documentation (or schema mismatch error message) to indicate that only mode with default value of NULLABLE is supported.
    I updated both the documentation and the schema mismatch error message.
  • Ignore the description field when verifying schema
    I took out the description field and used a list comprehension (code here) as @jreback asked.
  • In generate_bq_schema, set mode to the default which is NULLABLE
    Code here and I also updated one of the tests to include the mode field.
  • Any other schema differences should be raised to the user until we can add support in pandas for the more complicated options. (No changes necessary.)

@jreback

I also added a couple more tests to GBQUnitTests using the bigquery-public-data project id. They're meant to test the changes to generate_bq_schema. They are probably redundant since test_generate_schema already exists.

pandas/io/gbq.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove default_mode as a parameter for now since we will only support the 'NULLABLE' option?

@parthea
Copy link
Contributor

parthea commented May 8, 2016

Done reviewing. A few minor comments.

pandas/io/gbq.py Outdated
Copy link
Contributor

@parthea parthea May 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a unit test to confirm that 'description' is ignored when comparing the schema of an existing table?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! You could argue that test_df_with_matching_schema_should_verify_schema confirms that 'description' is ignored, but I see your point.

So I added test_verify_schema_should_ignore_column_description. The idea was to first prove that 'description' was the one and only key in the BigQuery table's schema that doesn't exist in the schema generated from the DataFrame. Then if the generated schema is verified with verify_schema, that proves description is in fact ignored. Feedback welcome.

There is one test that fails but that's because of this commit in testing.assert_index_equal.

@parthea
Copy link
Contributor

parthea commented May 10, 2016

All gbq tests pass locally

tony@tonypc:~/pandas-fix-to_gbq/pandas/io/tests$ nosetests test_gbq.py -v test_append_to_table_w_required_cols_should_raise_InvalidSchema (pandas.io.tests.test_gbq.GBQUnitTests) ... ok test_bq_table_with_required_fields_should_fail_verify_schema (pandas.io.tests.test_gbq.GBQUnitTests) ... ok test_df_with_matching_schema_should_verify_schema (pandas.io.tests.test_gbq.GBQUnitTests) ... ok test_read_gbq_with_corrupted_private_key_json_should_fail (pandas.io.tests.test_gbq.GBQUnitTests) ... ok test_read_gbq_with_empty_private_key_file_should_fail (pandas.io.tests.test_gbq.GBQUnitTests) ... ok test_read_gbq_with_empty_private_key_json_should_fail (pandas.io.tests.test_gbq.GBQUnitTests) ... ok test_read_gbq_with_invalid_private_key_json_should_fail (pandas.io.tests.test_gbq.GBQUnitTests) ... ok test_read_gbq_with_no_project_id_given_should_fail (pandas.io.tests.test_gbq.GBQUnitTests) ... ok test_read_gbq_with_private_key_json_wrong_types_should_fail (pandas.io.tests.test_gbq.GBQUnitTests) ... ok test_should_return_bigquery_booleans_as_python_booleans (pandas.io.tests.test_gbq.GBQUnitTests) ... ok test_should_return_bigquery_floats_as_python_floats (pandas.io.tests.test_gbq.GBQUnitTests) ... ok test_should_return_bigquery_integers_as_python_floats (pandas.io.tests.test_gbq.GBQUnitTests) ... ok test_should_return_bigquery_strings_as_python_strings (pandas.io.tests.test_gbq.GBQUnitTests) ... ok test_should_return_bigquery_timestamps_as_numpy_datetime (pandas.io.tests.test_gbq.GBQUnitTests) ... ok test_that_parse_data_works_properly (pandas.io.tests.test_gbq.GBQUnitTests) ... ok test_to_gbq_should_fail_if_invalid_table_name_passed (pandas.io.tests.test_gbq.GBQUnitTests) ... ok test_to_gbq_with_no_project_id_given_should_fail (pandas.io.tests.test_gbq.GBQUnitTests) ... ok test_verify_schema_should_ignore_column_description (pandas.io.tests.test_gbq.GBQUnitTests) ... ok test_should_be_able_to_get_a_bigquery_service (pandas.io.tests.test_gbq.TestGBQConnectorIntegration) ... ok test_should_be_able_to_get_results_from_query (pandas.io.tests.test_gbq.TestGBQConnectorIntegration) ... ok test_should_be_able_to_get_schema_from_query (pandas.io.tests.test_gbq.TestGBQConnectorIntegration) ... ok test_should_be_able_to_get_valid_credentials (pandas.io.tests.test_gbq.TestGBQConnectorIntegration) ... ok test_should_be_able_to_make_a_connector (pandas.io.tests.test_gbq.TestGBQConnectorIntegration) ... ok test_should_be_able_to_get_a_bigquery_service (pandas.io.tests.test_gbq.TestGBQConnectorServiceAccountKeyContentsIntegration) ... ok test_should_be_able_to_get_results_from_query (pandas.io.tests.test_gbq.TestGBQConnectorServiceAccountKeyContentsIntegration) ... ok test_should_be_able_to_get_schema_from_query (pandas.io.tests.test_gbq.TestGBQConnectorServiceAccountKeyContentsIntegration) ... ok test_should_be_able_to_get_valid_credentials (pandas.io.tests.test_gbq.TestGBQConnectorServiceAccountKeyContentsIntegration) ... ok test_should_be_able_to_make_a_connector (pandas.io.tests.test_gbq.TestGBQConnectorServiceAccountKeyContentsIntegration) ... ok test_should_be_able_to_get_a_bigquery_service (pandas.io.tests.test_gbq.TestGBQConnectorServiceAccountKeyPathIntegration) ... ok test_should_be_able_to_get_results_from_query (pandas.io.tests.test_gbq.TestGBQConnectorServiceAccountKeyPathIntegration) ... ok test_should_be_able_to_get_schema_from_query (pandas.io.tests.test_gbq.TestGBQConnectorServiceAccountKeyPathIntegration) ... ok test_should_be_able_to_get_valid_credentials (pandas.io.tests.test_gbq.TestGBQConnectorServiceAccountKeyPathIntegration) ... ok test_should_be_able_to_make_a_connector (pandas.io.tests.test_gbq.TestGBQConnectorServiceAccountKeyPathIntegration) ... ok test_bad_project_id (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_bad_table_name (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_column_order (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_column_order_plus_index (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_download_dataset_larger_than_200k_rows (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_index_column (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_malformed_query (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_should_properly_handle_arbitrary_timestamp (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_should_properly_handle_empty_strings (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_should_properly_handle_false_boolean (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_should_properly_handle_null_boolean (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_should_properly_handle_null_floats (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_should_properly_handle_null_integers (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_should_properly_handle_null_strings (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_should_properly_handle_null_timestamp (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_should_properly_handle_timestamp_unix_epoch (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_should_properly_handle_true_boolean (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_should_properly_handle_valid_floats (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_should_properly_handle_valid_integers (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_should_properly_handle_valid_strings (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_should_read_as_service_account_with_key_contents (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_should_read_as_service_account_with_key_path (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_unicode_string_conversion_and_normalization (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_zero_rows (pandas.io.tests.test_gbq.TestReadGBQIntegration) ... ok test_create_dataset (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok test_create_table (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok test_dataset_does_not_exist (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok test_dataset_exists (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok test_delete_dataset (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok test_delete_table (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok test_generate_schema (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok test_google_upload_errors_should_raise_exception (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok test_list_dataset (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok test_list_table (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok test_list_table_zero_results (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok test_table_does_not_exist (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok test_table_with_required_columns_shouldnt_append (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok test_upload_data (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok test_upload_data_if_table_exists_append (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok test_upload_data_if_table_exists_fail (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok test_upload_data_if_table_exists_replace (pandas.io.tests.test_gbq.TestToGBQIntegration) ... ok test_upload_data_as_service_account_with_key_contents (pandas.io.tests.test_gbq.TestToGBQIntegrationServiceAccountKeyContents) ... ok test_upload_data_as_service_account_with_key_path (pandas.io.tests.test_gbq.TestToGBQIntegrationServiceAccountKeyPath) ... ok pandas.io.tests.test_gbq.test_requirements ... ok pandas.io.tests.test_gbq.test_generate_bq_schema_deprecated ... ok ---------------------------------------------------------------------- Ran 78 tests in 450.072s OK 
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paragraph break

@jreback jreback added this to the 0.18.2 milestone May 10, 2016
@medullaskyline
Copy link
Author

medullaskyline commented May 11, 2016

@jreback @parthea Thanks for all your comments! I tried to do a squash with git rebase -i HEAD~20. Not sure what I'm supposed to see changed in the pull request.

Should I write more tests to address the issue with codecov/patch?

@jreback
Copy link
Contributor

jreback commented May 11, 2016

@medullaskyline you need to push -f to force the update

@jreback
Copy link
Contributor

jreback commented May 11, 2016

you can squash via git rebase -i origin/master as well

@medullaskyline
Copy link
Author

@jreback Just checking to see if there's anything else I need to do.

@jreback
Copy link
Contributor

jreback commented May 18, 2016

pls squash to a single commit (an rebase on master) / some codecov things changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a bulleted list (of the things that need to match)

@medullaskyline medullaskyline force-pushed the fix-to_gbq branch 2 times, most recently from 73428fe to dabe449 Compare June 2, 2016 23:55
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the * on column order (just on the list items)

@jreback
Copy link
Contributor

jreback commented Jun 3, 2016

needs a whatsnew entry. pls squash.

Author: medullaskyline <medullaskyline@gmail.com> Closes pandas-dev#13086 from medullaskyline
@jreback
Copy link
Contributor

jreback commented Jun 30, 2016

can you rebase and show a test run

@jreback
Copy link
Contributor

jreback commented Jul 15, 2016

can you rebase

@jreback
Copy link
Contributor

jreback commented Jul 20, 2016

@jreback jreback removed this from the 0.19.0 milestone Jul 20, 2016
@jreback
Copy link
Contributor

jreback commented Sep 9, 2016

can you rebase / update?

we also have new testing for gbq, pls look at the dev docs and see how to.


- Bug in ``Categorical.remove_unused_categories()`` changes ``.codes`` dtype to platform int (:issue:`13261`)

- Bug in ``DataFrame.to_gbq()`` where dataframes could not be appended to Google BigQuery tables that had been created in any way other than ``DataFrame.to_gbq()``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move to 0.19.2

@jreback
Copy link
Contributor

jreback commented Nov 16, 2016

@medullaskyline can you rebase / update

@jreback
Copy link
Contributor

jreback commented Jan 21, 2017

this was looking good, but needs updating. pls resubmit a rebased version if you can.

@jreback jreback closed this Jan 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Compat pandas objects compatability with Numpy or Python functions

4 participants