Python (PySpark) Developer
- Fully Remote working anywhere in the UK
- Experience in any of the following is required: Python, PySpark, Hadoop, SQL, ETL (more the better!)
- Inside IR35 (Self-employed contractor position), rates are negotiable
- intially three months, likely to extend
- Deadline: Tuesday 8th May 2021 at Noon
Description of Requirement:
The COVID19 Infection Survey (CIS), provides statistical analysis on the COVID19 pandemic for Government and Research purposes. The Data Processing Pipeline (DPP) receives data from multiple sources, processes this data in variety of ways, then provides the cleaned / linked data to several Analysis teams for their investigation. The DPP within CIS is an extremely busy function, working to demanding objectives, with rapid development and very short cycle times.
The role is responsible for the technical development and implementation of the processing pipelines specified and designed in conjunction with the COVID analysis teams. You will be working with a small coding team and a wider analysis team to identify the best ways of ingesting and processing the data received.
Throughout you will drive the delivery of ETL products, whilst supporting team members and others to apply agile and lean principles to deliverables. You will:
* Work with others, through pair programming, to implement a data engineering pipelines from the ETL of the data at ingest through to the creation of the final analytical dataset.
* As far as is possible, ensure consistency between pipelines ensuring that any associated standards are followed.
* Ensure that pipelines can be adapted and re-run should data quality issues be identified.
* Liaise with analysis teams to ensure that requirements for outputs are understood and incorporated into the pipeline.
* Address technical blockers, actively seeking solutions to remove them and proposing alternative routes to delivery.
* Routinely test the developed pipelines to ensure that they are fit for purpose.
* Peer review work of others in the team, to ensure quality and consistency.
Relevant Skills and Experience:
* Data engineering - functional programming in Python (including PySpark), multicore/distributed processing (hadoop), relational and non-relational (Hive/Impala) SQL database systems
* Data engineering - development of ETL routines for slowly changing dimensions
* Testing - writing unit tests for PySpark routines using pytest
* Agile - applying Agile software development practices
* DevOps - applying DevOps practices, including continuous integration and continuous deployment
* Analysis - identifying the root cause of data quality issues and communicating these with data providers