The leading provider of digital identity verification and fraud solutions. Salesinfo@socure.com
Senior Data Scientist – Big Data R&D, Identity Graph, KYC
Location
California
Posted
51 days ago
Salary
$140K - $170K / year
Seniority
Senior
Job Description
Senior Data Scientist – Big Data R&D, Identity Graph, KYC
Socure
• Own the design, development, and evaluation of machine learning, statistical, and graph-based algorithms for entity-resolution, identity trust scoring, and anomaly detection on massive datasets. • Architect and optimize graph-based identity representations (identity graph structure, linkage rules, clustering) to improve match rates, reduce false positives/negatives, and support downstream fraud and KYC models. • Build and maintain scalable data pipelines and feature stores in Spark/PySpark (or Scala), including data normalization, deduplication, and feature computation across large PII datasets in AWS/Databricks environments. • Lead A/B tests and offline/online experimentation for new models, features, and data sources; define success metrics, design experiments, and ensure rigorous validation before rollout. • Evaluate new internal and external data sources: explore signal quality, design backtests, quantify incremental value, and provide clear recommendations on vendor selection and integration. • Partner closely with product managers and engineers to translate ambiguous business and regulatory requirements (e.g., KYC coverage, watchlist matching) into concrete modeling and data roadmaps. • Provide deep analytical support to Socure’s compliance and regulatory product suite, including investigative analyses, root‑cause analysis for anomalies, and clear narratives for internal and external stakeholders. • Contribute to model governance and documentation: clearly explain model logic, data dependencies, limitations, and monitoring plans to internal risk/compliance stakeholders. • Mentor junior data scientists and engineers on best practices in data exploration, feature engineering, experimentation, and code quality. • Communicate complex technical concepts and trade‑offs in a concise, structured way to both technical and non‑technical audiences (e.g., product reviews, customer meetings, internal briefings).
Job Requirements
- Master’s degree with 3+ years of relevant industry experience, or Ph.D. with 1+ years of experience in applied ML / data science roles; background in Computer Science, Statistics, Mathematics, or related quantitative fields preferred.
- Strong proficiency in Python (preferred) or Scala, including experience with ML libraries such as scikit‑learn, XGBoost, TensorFlow or PyTorch.
- Extensive experience with Spark or PySpark and distributed data systems (e.g., AWS EMR, Databricks) working on very large, messy datasets.
- Deep understanding of supervised and unsupervised learning, feature engineering, model evaluation, and experiment design (A/B testing, holdout strategies, stratification).
- Experience developing production-quality data pipelines and automated workflows using Airflow or similar orchestration tools.
- Practical familiarity with graph databases and/or graph frameworks (Neo4j, AWS Neptune, GraphFrames, DGL, PyTorch Geometric) and graph algorithms for clustering, link prediction, and community detection is strongly preferred.
- Solid SQL skills and experience working with large-scale analytical data stores.
- Experience in at least one of: identity verification, fraud detection, credit risk, or adjacent high‑stakes domains is a plus.
- Demonstrated ability to lead medium‑to‑large projects end‑to‑end, make sound trade‑off decisions under ambiguity, and influence cross‑functional stakeholders with data and clear reasoning.
Benefits
- Offers Equity
- Offers Bonus
Related Guides
Related Categories
Related Job Pages
More Data Scientist Jobs
• Lead cross-functional initiatives to define, implement, and iterate on measurement and analysis of our Virtual Power Plant (VPP). • Proactively analyze and interpret complex data to uncover critical business, product, and user insights. Propose strategies for improving both the systems powering our VPP and the tools used by our energy industry partners. • Synthesize large-scale, disparate datasets — including smart meter data, device telemetry, settlement data, and logs — to aid in increasing the impact of our VPP products and services. • Build source of truth data models in DBT that serve as the foundation for reporting and decision-making. • Define and monitor KPIs to evaluate the success of new product features and business initiatives. • Influence product and business strategy by effectively communicating analytical results, recommendations, and tradeoffs to stakeholders across all levels of the organization. • Partner with engineers to define appropriate data instrumentation for new product features. • Perform statistical analysis and build predictive models on user interactions and smart device data to drive business recommendations and product strategy. • Contribute to the culture and workflows of the Product Analytics team – advocate for analytic best practices, introduce new tools and processes, perform code reviews, support your peers, and coach more junior members.
• Coordinate conversion activities, translate and document conversion requirements and detailed plans. • Drive execution tasks with various team members and stakeholders, track progress, and escalate issues. • Follow project management processes and best practices to ensure all tasks, activities, and resources are aligned to meet the overall program and organizational goals. • Work cross functional to complete Build Phase, support system configuration design, testing, migration and hyper-care activities. • Completion of conversion functional specification documents, data mapping, cross reference documents and data model documents across multiple deployments. • Drive conversion progress, including task completion, milestone achievement to adhere to overall timeline. • Lead meetings with cross functional thread leads. • Understanding team member workload, potential resource constraints, and ensure optimal resource utilization throughout the conversion process. • Proactive risk management: identify, assess, mitigate, and monitor potential risks with real-time alerts for critical issues. • Monitor and track conversion and deviation issue resolution. • Communicate expectations and instill accountability in team members. • Resolve conflicts, promote work sharing, and motivate teams toward common goals. • Manage multiple conversion activities, tasks and resources through effective organization prioritization, and time management practices to meet program objectives. • Additional responsibilities include documentation, profiling analytics, reporting, internal stakeholder communication, and identifying areas for improvement to enhance quality and efficiency of cutover activities.
Senior Data Scientist Specialist
US FoodsUS Foods is a foodservice distributor, partnering with restaurants and operators to help their businesses succeed.
Role Description The Sr Data Scientist executes statistical and mathematical analyses to support business decision-making. This role supports artificial intelligence/machine learning (AI/ML) in a deployed production environment. They are experts in the field of Data Science. Consequently, the Sr Data Scientist proactively works with cross-functional teams to seek and understand the business/domain context of potential problems that may be solved using ML. They present data science solutions to stakeholders in an iterative fashion. The Sr. Data Scientist works with Associate Data Scientists and Data Scientists in converting a business problem into an ML problem and kickstarting foundational work including data exploration, feature engineering, basic hypothesis testing, and creation of baseline models. They create the model feasibility document in close collaboration with the Data Scientists and Sr Data Scientists. The Sr Data Scientist may have direct reports including Associate Data Scientists and Data Scientists. This position is remote which means the work can be completed from anywhere in the United States except Hawaii or United States Territories. Essential Duties and Responsibilities - Work with Associate Data Scientists and Data Scientists in researching data sources, data cleansing, feature engineering, and creating baseline models. - Independently manage multiple data science model development/projects. - Collaborate and engage with various cross-functional stakeholders to ensure close coordination between data science and business stakeholders including the product management team. - Provide full ownership of the model development from inception to implementation including the machine learning pipeline, documentation, model drift, and remodeling strategy. - Mentor the Associate Data Scientists and Data Scientists in data science knowledge and their interactions with other stakeholders. - Coordinate with the ML team, anticipating deployment issues upfront, setting the stage for resource requirements, and ensuring the success of model deployment and retraining. - Take and delegate ownership of critical tickets or issues faced by business stakeholders as assigned by the Lead Data Scientist or Director of Data Science. - Read and independently navigate through extensive codebase with a focus on troubleshooting, debugging, and/or improvement of code. - Contribute to coding standards and coding conventions and proactively participate in enforcing sound coding practices across the data science team. - Create and maintain documentation of the ML models, data sources, and methodology. - Assist in the creation of feasibility documents. - Proactively participate in the core team processes including R&D, recommending, and implementing process improvements or new methodologies. - Share knowledge with other various teams and members by presentation at periodic internal team sessions/workshops. - Guide usage of development methodologies (Scrum, Kanban, etc.) and assist in their implementation in the group projects by deciphering details including features, tasks, estimates of efforts and timing, etc. Relationships - Internal: Associate Data Scientists, Data Scientists, Lead Data Scientists, Principal Data Scientists, and Director of DS. Work Environment - Remote Qualifications - Master’s degree in data science/mathematics/CS/statistics or similar field and 3+ years of experience OR Bachelor's degree and 6+ years of experience in data science. - Effective written, verbal, and interpersonal communication skills, with the ability to work and communicate effectively with team members and cross-functional teams. - An expert-level understanding of the following topics: - Classification - Regression - Clustering - Dimensionality reduction - Association rule learning - Bagging (e.g., Random Forests) - Boosting (e.g., AdaBoost, Gradient Boosting) - Neural Networks and Deep Learning - Model Evaluation and Metrics - Cross-validation - Hyperparameter Tuning - Optimization Algorithms - NLP and more - Working knowledge of Agile/SCRUM methodologies with 1-2 years of experience and a practical understanding of the epics, features, and tasks that may be most relevant and how much effort is required for them. - Sufficient intellectual curiosity to investigate the latest developments in the field (for example LLM/Generative AI models). - Ability to see the big picture and anticipate issues in terms of deployment limitations, data latency, appropriateness of a particular implementation of algorithms, and mathematical details/assumptions. - At least 1+ years of working with a major cloud platform, creating ML models using frameworks from data exploration to deployment. - Working knowledge in handling large datasets in a cloud environment. - Working knowledge of how APIs work, CI/CD (Continuous Integration/Continuous Deployment). - Understanding of DevOps and MLOps principles. - 1+ years of writing production-grade code in scripting languages including Bash, Python, and SQL with the ability to identify the areas of improvement, optimal choice of data structures, optimization of the code, and creating custom code to address unique situations. - Proven, independent ability to navigate a data lake environment while creating training datasets or feature stores. - Effective skills in creating, deploying, and troubleshooting containers in the context of ML modeling. - 1+ years of experience with version control systems/concepts, such as Git, for tracking changes in machine learning projects. - Ability to investigate/research ML frameworks/algorithms, datasets, feature stores, and contribute to solving data science problems. - Capability of providing macro-level thought leadership to the junior members including the Data Scientists regarding best practices, issues, causes, and respective resolutions of those issues. Benefits - Health insurance - Pre-tax spending accounts - Retirement benefits - Paid time off - Short-term and long-term disability - Employee stock purchase plan - Life insurance Compensation The expected base rate for this role is between $85,000 - $145,000. Compensation depends on relevant experience and/or education, specific skills, function, geographic location, and other factors as applicable by law (for example: state or local minimum wage thresholds). Equal Opportunity Employer EOE – Race/Color/Religion/Sex/Sexual Orientation/Gender Identity/National Origin/Age/Genetic Information/Protected Veteran/Disability Status
Role Description The Senior Curation Scientist supports Labcorp’s innovative oncology data initiatives by transforming complex oncology testing data into high‑quality, usable enterprise data assets. This role combines biomedical expertise with data curation, ontology development, and knowledge management to enable advanced analytics and data‑driven clinical insights. The position works within a collaborative, cross‑functional environment alongside software engineers, clinicians, and data consumers to support Labcorp’s enterprise oncology data strategy. Job Responsibilities - Collaborate with members of the Oncology Data Curation team and cross‑functional partners to design data structures in JSON and relational formats that reflect the complexity of oncology testing data. - Interpret oncology diagnostic test results to accurately map them into enterprise data models. - Develop mappings between clinical test results and enterprise data warehouse structures to enable large‑scale data integration. - Apply and maintain clinical data standards including ICD‑O‑3, FHIR, mCODE, LOINC, ICD‑10, UMLS, SNOMED, and related standards as appropriate. - Standardize disease and biomarker data using controlled biomedical vocabularies and classifications. - Expand and maintain biomedical ontologies to support evolving oncology knowledge. - Apply natural language processing techniques to extract structured information from clinical test results. - Curate, analyze, harmonize, and manage large oncology data sets while ensuring high data quality standards. - Partner with engineering teams and participate in the full development lifecycle, from research and prototyping through production deployment. - Support data quality, governance, and continuous improvement initiatives across oncology data platforms. Qualifications - Bachelor’s degree in Applied Mathematics, Statistics, Biochemistry, or a related discipline with 5 or more years of relevant experience; or Master’s degree with 3 or more years of relevant experience. - 1 or more year of demonstrated experience in bioinformatics, information integration, or knowledge modeling. - 1 or more year of experience working with and processing next‑generation sequencing (NGS) datasets and databases. - 1 or more year of experience using general bioinformatics resources such as NCBI Entrez Gene, PubMed, OMIM, or ClinVar. Preferred Qualifications - PhD with 2 or more years experience. Requirements - Familiarity with clinical data standards such as HL7 and FHIR. - Understanding of cancer biology and oncology therapies. - Experience curating biomedical data and integrating terminologies for ontology development. - Experience working with relational databases and platforms such as Databricks or SAP HANA. - Proficiency in Python or Tcl scripting and SQL query development. - Familiarity with tools such as Jira, Confluence, and version control systems (Git). - Experience integrating public and proprietary biomedical information sources at scale. - Exposure to scientific natural language processing or text‑mining technologies. - Strong attention to detail with excellent organizational, analytical, and communication skills. - Ability to collaborate effectively across multidisciplinary project teams with a strong sense of ownership. - Ability to demonstrate relevant research expertise through publications, presentations, software tools, or applied knowledge. Benefits - Employees regularly scheduled to work 20 or more hours per week are eligible for comprehensive benefits including: Medical, Dental, Vision, Life, STD/LTD, 401(k), Paid Time Off (PTO) or Flexible Time Off (FTO), Tuition Reimbursement and Employee Stock Purchase Plan. - Employees regularly scheduled to work less than 20 hours, Casual, Intern, and Temporary employees are only eligible to participate in the 401(k) Plan. - Employees who are regularly scheduled to work a 7 on/7 off schedule are eligible to receive all the foregoing benefits except PTO or FTO.




