DuckDuckGoose AI

Senior Software Engineer | Ml Data Platform – DuckDuckGoose AI – Delft

Jobid=5976672a171d (0.0175)

Senior Software Engineer — ML Data PlatformLocationDelft – the Netherlands (hybrid)TypeFull-timeStartASAPThe internet has entered an era where reality is generatable. We build the infrastructure that helps institutions distinguish real from synthetic — at scale, protecting citizens, enterprises, and governments from synthetic media fraud. Everything you see and hear online can now be manipulated — our job is to make sure people can trust what they see. As part of our forensics platform team, you’ll work on the data backbone that makes large-scale detection possible, from ingestion and versioning to training, evaluation, and production.You’ll join a small, senior team where your work will have immediate impact, and you’ll have ownership over the systems you build.You’ll work on technically challenging problems such as:Tracking model-family clusters across synthetic media typesDesigning reproducible forensic benchmarks at scaleManaging large-scale image/video datasets with auditable provenanceCreating deterministic dataset builds for research and production environmentsWhat You’ll DriveData platform architecture: Define unified schemas, lineage, and dataset versioning for large image/video + context data.Ingestion at scale: Build reliable pipelines from research repos, APIs, and internal generators; automate connectors andjobs.Quality & governance: Implement deduplication, validation, health dashboards, and drift/coverage checks with auditable lineage.Curation & access: Deliver one-command dataset builds, deterministic splits, and fast sampling tools for training/eval.Performance & cost: Tune S3/object storage layouts, partitioning, and lifecycle policies for speed and spend.Orchestration & ops: Productionize pipelines with CI/CD, containerization, scheduling/monitoring, and safe rollbacks.Reliability & operations: Build for simplicity and observability; participate in a planned, compensated support rotation.Engineering productivity: Create internal tools/CLIs, docs, and templates that make everyone faster.Must havesStrong software engineering foundation: Master’s in Computer Science, Data Engineering, or a related field.Production experience: 5–8+ years building and operating data platforms for large unstructured datasets (images/video).Pipelines & orchestration: Experience with modern schedulers (e.G., Airflow/Prefect) and containerized jobs.Storage & formats: Hands‑on with object storage (e.G., S3), columnar formats/partitioning, and performance tuning.Versioning & lineage: Experience with dataset versioning and reproducibility (e.G., DVC/lakeFS/Delta or equivalents).Quality at scale: Deduplication, schema/label checks, and automated QC gates in CI.Security & privacy: IAM, access controls, and privacy‑aware workflows suitable for regulated customers.Domain awareness: Familiarity with digital forensics, misinformation threats, or synthetic media — and willingness to deepen expertise.Flexibility: Comfortable moving between data engineering, infra, and tooling tasks when needed.Mindset & delivery: Thrive in a fast‑moving environment; proactive problem‑solver; ship, measure, simplify.Communication: Excellent written and verbal skills; explain complex ideas clearly.Independence: Deliver quality work on time without constant oversight.Nice‑to‑havesStreaming & events: Kafka/Kinesis or similar for near‑real‑time ingestion.Vector search: Experience with embedding stores or similarity search at scale.Synthetic data: Building pipelines to generate/stress‑test rare scenarios.Cloud & on‑prem: Terraform/CDK, Kubernetes, and hybrid/on‑prem data deployments.FinOps: Cost monitoring and optimization for data workloads.Technical track record: Strong GitHub, open‑source contributions, publications, patents, or public talks.Leadership: Mentoring and guiding technical direction.Dutch language: Fluency is a plus.A unified schema + catalog with key datasets onboarded, versioned, and reproducibly built via one command.Automated QC gates (dedup/validation) with a red/amber/green dataset health dashboard and clear lineage.Fast sampling/curation tools for the ML team, plus cost controls (storage layouts, lifecycle policies) in place.Data migration: Inventory and migrate existing/legacy datasets into the new platform; reformat to the new schema, backfill metadata, validate checksums/lineage, and deprecate legacy paths with a rollback plan.Own the backbone: Define schemas, lineage, and dataset versioning used across research and production.Company participation: Meaningful equity/virtual shares aligned with company growth.Flexible work: Hybrid (Delft), flexible hours, minimal ceremony, async‑first collaboration.Data platform mandate: Real say in stack choices (orchestration, catalog, storage/layout) and time to implement them right.Repro & auditability: Space to enforce deterministic builds, splits, and traceable lineage—no heroics needed.Quality culture: Backing to implement dedup, drift/coverage checks, and dataset health dashboards org‑wide.FinOps mindset: Budget and support to balance speed, reliability, and total cost.Pragmatic on‑call: Planned, compensated rotation with automation‑first recovery and rollback plans.Growth path: IC track to Staff/Principal; opportunities to mentorand codify data standards.Learning budget: Annual budget for courses/books + two data/ML‑infra conferences per year.Home office: Modest stipend for an ergonomic setup; commuting support (public transport or mileage).Relocation + visa: Visa sponsorship and relocation support for internationals.Join us and be part of a company committed to creating a more secure and trustworthy digital future. Apply today to become part of our mission‑driven team!#J-18808-Ljbffr

Lees hier meer

Deel deze vacature: