Shubham A.

Shubham A.

Expert Bigdata engineer with Spark, SQL-NoSQL DBs, Python, Scala, AWS

Pune , India

Experience: 2 Years

Shubham

Pune , India

Expert Bigdata engineer with Spark, SQL-NoSQL DBs, Python, Scala, AWS

25714.1 USD / Year

  • Notice Period: Days

2 Years

Now you can Instantly Chat with Shubham!

About Me

  • Currently working in Tapnomy on Bigdata operations (Hive, Spark, Kafka, Airflow,  MySQL, MongoDB, Hive, Cassandra, Scala, Python, Presto, Qubole, EMR, Cloudera, AWS).
  • Working on AWS cloud computi...
  • Working on AWS cloud computing services (EMR, S3, Athena, DynamoDB etc.)
  • Working experience in Innoplexus as data engineer:
    • Data streaming using Apache Kafka
    • Data storing in database (MongoDB, MySQL, Hive, Redis)
    • Data processing using PySpark and Spark with Scala
    • Data analysis using Pandas library
    • Data crawling using BeatifulSoup and Selenium
    • Proficient in ElasticSearch, MongoDB, Hive, MySQL
    • Proficient in Spark with scala and PySpark
    • Advanced knowledge of Python and Scala
  • Working experience in Vyom Labs Pvt. Ltd. on backend.
  • Chief Technical Advisor in Walchand Linux Users' Group (WLUG).

Show More

Portfolio Projects

Description

GDPR (General Data Protection Regulation) is a regulation in EU law on data protection and
privacy for all individuals within the European Union (EU) and the European Economic Area (EEA).
Chartboost also comply with past and up coming data. If any client request to remove its account then
Chartboost should erase client information from Chartboost system with given amount of time. In comply
with GDPR need to create an automated solution which can look for requested device-ids from entire data
warehouse and anonymise clients data.
Responsibilities:
○ Single point of contact for GDPR request from data-team.
○ Design a solution which can handle simple and massive GDPR request.
○ Develop configurable multithreaded environment to handle any kind of request.
○ Add monitors in on-going GDPR request process so that will get an alert when process fail or killed.
○ Create an airflow data-pipeline for daily status automatically.
○ Collaborate with product team to build new features and infrastructure.

Show More Show Less

Description

Chartboost is the largest mobile games-based ad monetisation platform in the world. The
Chartboost SDK is the highest-integrated independent mobile ad SDK and through Chartboost Exchange, Ad
Network, and other services. Chartboost developers to build businesses, while connecting advertisers to
highly engaged audiences.
Responsibilities:
○ Design realtime and batch processing pipelines for different systems to consume.
○ Make sure all the core data-pipelines working as expected, fixed if there are any issues.
○ Performed Data Cleaning, Data Validation and Scaling using Airflow, Hive , Spark.
○ Create airflow data-pipeline to solve performance issues and complexity of daily usage of data.
○ Collaborate with product team to build new features and infrastructure.

Show More Show Less

Description

Chartboost data-pipeline contains 200+ DAGs. Purpose of DAG-builder library is to complete
the migration of DAGs from cloudera to AWS-EMR as early as possible. DAG-builder library is to simplify the
Airflow DAG creation.
Responsibilities:
○ Design and build whole library from scratch.
○ Migrate whole data-pipeline from Cloudera to AWS-EMR.
○ Convert all hive jobs to spark jobs.
○ Optimize the spark-job performance and minimize the cost.
○ Tuning the spark jobs through the spark code and EMR cluster configurations.
○ Collaborate with product team to add new functionalities in library.

Show More Show Less

Description

Description: Legalplexus provides platform where user can upload documents like pdf, docx, text images,
scanned documents. It provides facility of scanned_documents_reader and scanned_documents_editor.
Responsibilities:
○ Worked independently on backend operations and data load operations.
○ Single point of contact for the whole backend APIs and data.
○ Design and build efficient data-pipeline.

Show More Show Less

Description

Company360 is a Web Portal for banking sector which provides information of different
organizations and their competitors. Information like products, services, industry, sector, About_us,
Sentimental analysis of company news, Employee details, Boards in company, financial data,
announcements, litigations, etc.
Responsibilities:
○ Worked independently on whole ETL and backend operations.
○ Single point of contact for the whole backend APIs and data.

Show More Show Less

Description

Innoplexus provides advanced artificial intelligence (AI) and blockchain solutions that support
all stages of drug development from pipeline to market. Innoplexus identifies and extracts structured and
unstructured life science data by scanning up to 95% of the world-wide-web and merges it with enterprise
and third-party data in an ongoing, real-time process. This continually updated data provides solutions that
serve pharmaceutical companies and biotech industry. Data volume is 300+ TB.
Responsibilities:
○ Design realtime and batch processing pipeline for different systems to consume.
○ Make sure data-pipeline is working as expected, fixed if there are any issues.
○ Performed Data Cleaning, Data Validation and Scaling using Spark, Pandas.
○ Collaborate with product team to build new features and infrastructure.

Show More Show Less