Data engineering
Informacje ogólne
Kod przedmiotu: | 1000-2M23DE |
Kod Erasmus / ISCED: |
11.3
|
Nazwa przedmiotu: | Data engineering |
Jednostka: | Wydział Matematyki, Informatyki i Mechaniki |
Grupy: |
Przedmioty 4EU+ (z oferty jednostek dydaktycznych) Przedmioty obieralne dla informatyki Przedmioty obieralne dla Machine Learning |
Punkty ECTS i inne: |
6.00
|
Język prowadzenia: | angielski |
Rodzaj przedmiotu: | monograficzne |
Skrócony opis: |
Overview of the data processing pipeline; collection and storage of raw data; processing, cleaning, and storage of processed data; scaling tools for the data processing system. |
Pełny opis: |
The course will go from the basics of the data engineering task and show what is different about such a system from something like a personal blog or e-market platform. Shortly defining areas where data engineering approaches make sense. After this will give an overview of the file formats and why it important, and tries to show the decomposition of the general idea of the database. Describing a way to store data in the system. Demonstrating how to implement some processing tasks in the context of the data pipeline and giving tooling on how to conduct or orchestrate independent tasks into a single pipeline. In addition, will be describing tools such as queues to conduct different elements of the data engineering system with each other as well as with elements outside it. 1. Introduction, MAD, MDS, Data Engineering life cycle, sources of information and self-education 2. Evolution of Data Engineering, Lambda architecture, KAPPA, cloud native, storage and computer separation 3. Source system 4. Data modelling, transformation, DAG, Spark 5. Data warehouse, data lake, lake house 6. Data governance, Data Hub 7. Streams vs queues, Spark, Pulsar 8. Decomposition, orchestrations, Prefect 9. Consumers, Superset 10. Quality, security, observability 11. Data Engineering architecture and with whom we work 12. Project demo 13. Summary |
Literatura: |
1. Designing Data-Intensive Applications. Must read(even reread) book. 2. Data Engineering with Apache Spark, Delta Lake, and Lakehouse: Create scalable pipelines that ingest, curate, and aggregate complex data in a timely and secure way book to get structural knowledge about the tools' family of data bricks. 3. Kafka in Action For one who doesn’t want to read the docs. Will be out of date in 1-2 years, but now it is good to get intuition. 4. The Log: What every software engineer should know about real-time data unifying abstraction must read the article (yep, it is ok that it is from 2013) and a good blog to read in general https://engineering.linkedin.com/blog/topic/distributed-systems. 5. How to beat the CAP theorem. 6. Questioning the Lambda Architecture. 7. How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh. 8. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. 9. Towards Data Science – as a source of some news, good for beginners. 10. https://medium.com/the-prefect-blog lot of articles that are good to read for beginners (i.e. https://medium.com/the-prefect-blog/are-you-an-accidental-data-engineer-6b60e0f51286 can skip everything, which related to Prefect directly) |
Efekty uczenia się: |
Understanding the basic principles of most data processing tasks & the mechanics of modern tools |
Metody i kryteria oceniania: |
- Lab projects. - If some LMS is used – the topic assessments with peer review. assessments included. - Final project. |
Zajęcia w cyklu "Semestr zimowy 2023/24" (zakończony)
Okres: | 2023-10-01 - 2024-01-28 |
Przejdź do planu
PN WT ŚR CZ PT WYK
LAB
|
Typ zajęć: |
Laboratorium, 30 godzin
Wykład, 30 godzin
|
|
Koordynatorzy: | Yura Braiko | |
Prowadzący grup: | Yura Braiko | |
Lista studentów: | (nie masz dostępu) | |
Zaliczenie: | Egzamin | |
Uwagi: |
Zajęcia są prowadzone w języku angielskim i w sposób zdalny. |
Zajęcia w cyklu "Semestr letni 2024/25" (jeszcze nie rozpoczęty)
Okres: | 2025-02-17 - 2025-06-08 |
Przejdź do planu
PN WT ŚR CZ PT |
Typ zajęć: |
Laboratorium, 30 godzin
Wykład, 30 godzin
|
|
Koordynatorzy: | Yura Braiko | |
Prowadzący grup: | Yura Braiko | |
Lista studentów: | (nie masz dostępu) | |
Zaliczenie: | Egzamin | |
Tryb prowadzenia: | zdalnie |
Właścicielem praw autorskich jest Uniwersytet Warszawski, Wydział Chemii.