keyboard_backspace work
Project Overview
Pandas is a highly popular tool for data science and AI, boasting millions of dedicated users. With its arsenal of over 600 functions, pandas empowers data scientists to efficiently clean, transform, summarize, and featurize data. However, pandas struggles with larger datasets, causing out-of-memory errors and slow performance. As datasets scale, the only viable alternative is to turn to "big data" frameworks like database systems or Spark. Nevertheless, these technologies come with a steep learning curve, particularly for users without a computer science background, necessitating months, if not years, to master. As a result, many data teams continue to prefer using pandas rather than database systems or Spark.

During my time at Ponder, I lead development and launch of Ponder's B2B platform, a technology that allows data teams run pandas anywhere at scale, be it their laptops, clusters, or on the cloud. In particular, Ponder allows data teams to run their pandas code directly in their data warehouse, be it Snowflake, BigQuery, or Redshift. Our MVP launch has attracted a lot of community attention and led to the acquisition of early adopters, POC customers, and paying customers.
My Role

  • 01. Conducted interviews with with different data teams globally to gain insights and inform the product strategy, leading to the identification of the right product-market fit.
  • 02. Prioritized customers needs and converted them to actionable engineering-facing user stories.
  • 03. Closely worked with engineering, marketing, sales, and leadership teams to define product roadmap.
  • 04. Set data-driven KPI and leveraged user data to drive dynamic marketing and customer acquisition decisions.
  • 05. Conducted dogfooding sessions and soft launches as part of the product planning process to ensure that the product is fully tested and ready for a wider audience prior to its official launch.
  • 06. Created and maintained product documentation.
  • 07. Presented product strategy and roadmap to the leadership.

Teams Involved

Engineering, Marketing, Sales, Leadership

...

Customer and Market Research

After conducting 100 interviews with individual data enthusiasts and data teams, as well as hundreds of surveys, we uncovered that many data teams in small and large organizations frequently face challenges when it comes to scaling their pandas workflows. Furthermore, our research yielded valuable insights into the needs and preferences of potential customers, allowing us to develop a tailored strategy that aligns with their unique requirements.

Customer Segmentation

One of the main outcomes of our customer research was to better understand the customers landscape and their segments. In particular, we segmented our customers into two distinct groups: 1) data teams in organizations and 2) individual data enthusiasts.

Customers pain points
Segment Prioritization

Our research has revealed that the market size for individual data enthusiasts is significantly larger than that of data teams in organizations. Furthermore, in our discussions with many data enthusiasts, we discovered that they had a strong preference for using pandas for their data analysis. However, we found that this group of customers is not a great fit for our product becuase they typically don't work with big data. Thus, they rarely need a technology to enbale them to scale their workflows.

On the other hand, data teams in organizations were the ones that faced many issues scaling their pandas workflows because they often deal with big data. Moreover, they expressed their desire to use our proposed solution on a daily basis and pay for it, as they believe our product can effectively address their current challenges with big data. Based on these findings, we have decided to shift our focus to target data teams in organizations as our primary customer segment.

Competitive analysis of well-known tools used for scaling pandas

I conducted competitive analysis of well-known libraries and technologies that try to add more power to pandas in various ways. This helped me and the rest of the team in understanding the strengths and weaknesses of each tool and how they stack up against each other.

Key takeaways from my competitive analysis
In summary, what we learned was that the there is lack of platform that enable users to scale their pandas workflows without having to manage computing clusters or learning a new syntax.

Mars Polars Modin Vaex
Pricing Open Source Open Source Open Source Custom Pricing
Target Users Anyone Anyone Anyone Data Teams
Syntax Simility to pandas Drop-in replacement Different Syntax Drop-in replacement Different Syntax
Coverage Medium Large Large Large
Ease of Use Easy to use Not too easy to learn Easy to use Hard to learn
Limitations Not complete New Syntax, learning curve Uses Dask or Ray as engine New Syntax, learning curve
User Persona
To gain a deeper understanding of our target audience's goals, behaviors, motivations, pain points, and preferences, I created multiple user personas. These personas helped the team to design a feature that meet users' needs, expectations, and desires. User personas also help me to communicate with stakeholders by providing a shared understanding of the target users.

...

Product Development
User Stories & Acceptance Criteria
User Story Acceptance Criteria
As a data scientist, I would expect your syntax to be identical to that of pandas.
  • 01. Our marketing, sales, and documentation materials should emphasize that our solution is a drop-in replacement for pandas.

As a data engineer, I prefer to use your solution in my own development environment rather than using it solely in notebooks.
  • 01. Enable customers to pip install ponder package on their own machines without having to rely on our UI.

As a data engineer, I never want my data leave where it lives.
  • 01. Ensure the entire computation is pushed down to the data warehouse.

  • 02. Avoid executing queries on the memory.

As an analyst, I want to monitor long-running queries that may be consuming excessive memory and computation resources.
  • 01. Provide a monitoring dashboard to identify any long-running queries that may be consuming excessive resources.

  • 02. Provide automated notifications that alert users when a runnign expensive queries.

As a data analyst, it is essential for me to have the ability to access and extract data from various sources, including Snowflake, BigQuery and S3 Bucket.
  • 01. Support connection and access to multiple data sources at the same time.

  • 02. Support I/O APIs for different resources and in various formats (e.g., csv, parquet).
As a data analyst, I use common libraries such as Matplotlib, Seaborn, NumPy, scikit-learn in my pipelines.
  • 01. To ensure end-to-end support for data science pipelines, we need to provide support for other popular libraries, including Matplotlib, Seaborn, NumPy, and scikit-learn.

Risk & Mitigation
Risks Mitigation
Running over budget due to unforseen expenses or poor cost estimates.
  • 01. Create buffer in financial planning.

  • 02. Prioritize the most essential features to ensure that resources are allocated appropriately.

Failing to generate revenue due to poor marketing or business decisions.
  • 01. Regularly conduct market researcht to analyze the market interest in adopting our technology.

  • 02. Be ready to change product strategy or customer segment if needed.

Running behind schedule due to unforseen delays, poor time or resource management.
  • 01. Schedule daily standups for the engineering team.

  • 02. Schedule a weekly meeting for cross-functional teams to attend.

  • 01. Break development into smaller and more managable tasks to help ensure that progress is being made ad on a consistent basis.
Technical issues or bugs that prevent the product from functioning properly or negatively impact the customer experience.
  • 01. Ensure that we have a robust testing process in place to detect and address any technical issues or bugs before our product is released to customers.

  • 02. Offer comprehensive customer support to help customers address any issues or bugs they may encounter.
The risk of unauthorized access to sensitive data, resulting in data theft or loss.
  • 01. Use robust data encryption technologies, multi-factor authentication, and secure hosting environments to protect user data. Regularly test and audit the security of the system and adopt the best security practices.
Competitors may reverse engineer the software product code, potentially leading to intellectual property theft, loss of revenue, and security vulnerabilities.
  • 01. Implement software obfuscation techniques, legal protections, limiting access to the code, and monitoring for unauthorized use.

Product Roadmap
Timelines and Milestones

I then collaborated with the engineering team to create a product timeline to help guide the development and delivery of the product to market. Timelines and milestones are important aspects of a product roadmap as they provide a clear structure and timeline for the product development process.

Tracking Timelines and Milestones

I then created a Jira board to track our progress, manage our backlog of work items, and collaborate with each other. We also used Jira for creating and assigning bugs/issues.

...

System Design Overview

...

Go-to-Market Strategy
Messaging
I collaborated with the marketing team to take our knowledge and learning from our market research and turned it into market messaging.
Product Messaging
In collaboration with the marketing and sales team, we also worked on marketing strategy funnel.
MVP Launch
MVP Documentation

I have been responsible for both creating and maintaining product documentation using Sphinx, an open-source documentation generator, which involves writing clear and concise documentation using reStructuredText markup language and updating it regularly to ensure accuracy and relevance.

MVP Design

Below, I have provided a series of screenshots to illustrate the step-by-step process that users must follow in order to work with Ponder platform.

For those users who prefer using Ponder in their own development environment, they could install Ponder by importing pip install ponder in their work environment.

Key Takeaways
keyboard_backspace work