Error when creating StructType from JSON Schema in Pyspark

N

noobie

Guest
The requirement is to load records from a given CSV, and for each record in the file, construct a json message that will be sent to a third party API endpoint via HTTP Post.

The schema is actually quite large (~20 fields in the items property array), so I have shortened it for this post.

I have the CSV, Json schema and a sample JSON file. I welcome any ideas on a simpler approach, but here is what I am trying to do in pyspark:

  1. Load CSV into a DataFrame.
  2. Construct a StuctType from the schema.
  3. Create a DataFrame from the StructType.
  4. Populate the field from the DataFrame (loaded from CSV) to the new structured DataFrame (empty dataframe constructed from json schema).

The issue I have at the moment, is when I try execute the following statement:

payload_schema = StructType.fromJson(json.loads(json_schema))


I get this error:

KeyError: 'fields'
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<command-3036218659251259> in <module>()
3
4 # Restore schema from json:
----> 5 payload_schema = StructType.fromJson(json.loads(json_schema))
6
7

/databricks/spark/python/pyspark/sql/types.py in fromJson(cls, json)
575 @classmethod
576 def fromJson(cls, json):
--> 577 return StructType([StructField.fromJson(f) for f in json["fields"]])
578
579 def fieldNames(self):

KeyError: 'fields'


Here is the schema that is assigned the 'json_schema' variable. I have used JSON.stringify to be able to assign it to a variable:

json_schema = '{"$schema":"http://json-schema.org/draft-04/schema#","type":"object","properties":{"PersonID":{"type":"string"},"CreatedOnUtc":{"type":"string"},"Details":{"type":"array","items":[{"type":"object","properties":{"FieldCode":{"type":"string"},"Value":{"type":"string"}},"required":["FieldCode","Value"]},{"type":"object","properties":{"FieldCode":{"type":"string"},"Value":{"type":"string"}},"required":["FieldCode","Value"]}]},"Country":{"type":"string"}},"required":["PersonID","CreatedOnUtc","Details","Country"]}'


Here is the schema as is:

{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"PersonID": {
"type": "string"
},
"CreatedOnUtc": {
"type": "string"
},
"Details": {
"type": "array",
"items": [
{
"type": "object",
"properties": {
"FieldCode": {
"type": "string"
},
"Value": {
"type": "string"
}
},
"required": [
"FieldCode",
"Value"
]
},
{
"type": "object",
"properties": {
"FieldCode": {
"type": "string"
},
"Value": {
"type": "string"
}
},
"required": [
"FieldCode",
"Value"
]
}
]
},
"Country": {
"type": "string"
}
},
"required": [
"PersonID",
"CreatedOnUtc",
"Details",
"Country"
]
}


Any advice or ideas would be greatly appreciated. Thanks in advance.

Continue reading...
 
Top