[2024.11] Data Umbrella Newsletter: November 2024
We organize data science events for the community.
Data Umbrella is a non-profit global community for underrepresented persons in data science. We organize online data science events for the community. All levels are welcome. Our Code of Conduct applies to all of our spaces.
Announcements
Community News
NASA Funds Open-Source Software Underpinning Scientific Innovation
NASA has awarded $15.6 million in grant funding to 15 projects supporting the maintenance of open-source tools, frameworks, and libraries used by the NASA science community, for the benefit of all.
The Open Collective Platform is moving to a community governed non-profit!
A group of fiscal hosts representing thousands of collectives have created a new independent, community-governed, non-profit organization and have reached an agreement with Open Collective Inc. to take over the Open Collective platform as it exists today.
Black Women in Data Conference 2024
Data Umbrella is honored to be a Community Partner for the 2024 conference “Black Women in Data” conference. Join their waitlist for 2025 information.
Follow the lead organization behind BWID, DataedX Group on LinkedIn to get updates on upcoming events.
The Impact of Toxic Influencers on Communities
There’s a common character that can be often spotted in many online communities. It’s the “toxic influencer” (sometimes referred as “intellectual bully”). A toxic influencer is a person that covers a useful role in a community, but at the same time creates a toxic environment around them that can become bullying in some instances — yet they never over-step the line of proper abuse, so they can’t be easily managed.
Erin 'Folletto' Casali explores toxic influences and their impact on online communities.
Open Science "Dynamic Convergence" Workshop Final Report
In collaboration with CERN and UNESCO and with the participation of NASA, the Open Research Community Accelerator (ORCA) was pleased to host 90 researchers, students, policymakers, funders, and others from across 30 countries at a two-day event in Washington, DC.
The event, held September 18-19, 2024, highlighted impactful open science activities, explored collaboration opportunities, and identified practical ways to speed up the global adoption of open science. More details about the workshop, including presentations and videos, may be found in this summary report.
Community Partnerships
NumHack 2024 (Online Hackathon) (Nov 22 - 24)
PyData Global Impact Scholarship group (Impact Hackathon Committee), under the NumFOCUS umbrella, is organizing an online, AI-focused hackathon in November 2024.
The Hackathon is open to anybody who registers. Teams of at least 3 members are encouraged, and opportunities to form teams two-weeks prior to the Hackathon will be announced. Register here.
PyData Global 2024 (online conference) (Dec 3-5)
PyData Global 2024 is a 3-day virtual event for the international community of data scientists, data engineers, and developers of data analysis tools to share ideas and learn from each other. Register now, there are discount tickets available.
PyLadiesCon 2024 (online conference) (Dec 6-8)
Attention all PyLadies community members! We’re excited to share that we are in the early stages of planning a PyLadies Conference (PyLadiesCon), a transformative event designed to promote diversity, learning, and empowerment within the Python community. 🎉
Save the date! The conference will take place on December 6th-8th, where we’ll gather together for a weekend filled with insightful talks, engaging panels, and collaborative networking opportunities.
The WordPress vs. WP Engine drama, explained
The world of WordPress, one of the most popular technologies for creating and hosting websites, is going through a very heated controversy. The core issue is the fight between WordPress founder and Automattic CEO Matt Mullenweg and WP Engine, which hosts websites built on WordPress.
In mid-September, Mullenweg wrote a blog post calling WP Engine a “cancer to WordPress.” He criticized the host for disabling the ability for users to see and track the revision history for every post. Mullenweg believes this feature is at the “core of the user promise of protecting your data” and said that WP Engine turns it off by default to save money.
He also called out WP Engine investor Silver Lake and said they don’t contribute sufficiently to the open source project and that WP Engine’s use of the “WP” brand has confused customers into believing it is part of WordPress.
How do we fund open source?
Considering the ubiquity of open source, regulating its support through public funding appears to be the strongest long-term vision. However, such government-led initiatives are at an early stage. Until public initiatives mature, the ecosystem will need a combination of corporate sponsorship, foundational stewardship, and increased public awareness to ensure its survival.
Source: Serkan Holat on LinkedIn: How do we fund open source?
What can modern topic modelling look like?
Leland McInnes has built an interactive map of topics from 2.4 million papers on the ArXiv preprint server using sentence-transformers+nomic-embed, UMAP, HDBSCAN, Toponymy, and DataMapPlot to show what modern topic modelling can look like.
Source: Leland McInnes on LinkedIn: What can modern topic modelling look like?
Python 3.9
Now that Python 3.13 has been officially released, Python 3.8 has reached its end of life. Python 3.9 features can now be considered the "standard" across all actively supported versions.
Check out the official Python version support timeline: https://devguide.python.org/versions/
Source: Jeff Triplett (@webology@mastodon.social)
Timestamps
CONTRIBUTE TO TIMESTAMPS: We still have about a dozen videos which need timestamps. We have instructions on how you can contribute to this project on GitHub. Help us help the community. Pick a video and get started.
Thank you to community member kljohnson for their contributions to Data Umbrella by adding timestamps to our video Best Practices for Creating a Data Science Team.
Call for Suggestions
Do you have suggestions for future webinar topics or speakers? Would you like to speak on a topic? For these and any other suggestions, please complete our Online Suggestion Box or email us at info@dataumbrella.org.
Call for Speakers
We are looking for speakers on the following topics:
Data Privacy
Data Engineering
Generative AI
Software engineering
Code quality
Email us if you are interested in speaking or have a speaker or topic suggestion: info@dataumbrella.org
Upcoming Events (free & online webinars)
Polars & Narwhals: Understanding Expressions When You're Used to pandas
November 19, 2024
When it comes to dataframes, pandas is the go-to library for many people. Yet Polars is taking the world by storm, and so many data practitioners are curious about trying it out. There is a learning curve though, as Polars introduces some concepts which pandas users might not be familiar with. This talk will be a deep dive into one of those concepts (expressions) and will focus on how you can understand them from a pandas perspective.
The lessons learned will be useful beyond Polars, as they will also enable you to use Narwhals. Narwhals is a lightweight and extensible compatibility layer between dataframe libraries which is gaining traction (Altair, Marimo, scikit-lego, and more are currently using it) - like Polars, its API is also based on expressions. By learning this concept, you will not only be able to use Polars efficiently, but you'll also know how to build dataframe-agnostic tools.
Videos
In case you missed our recent events, the videos have been posted. Subscribe to our Data Umbrella YouTube to receive notifications when the videos premiere.
Polars for Data Analysis in Python
Discover Polars, the high-performance DataFrame library revolutionizing data analysis in Python. Built on Rust, Polars offers unparalleled speed and efficiency, outperforming pandas, Dask, and even PySpark. Explore its innovative features like lazy evaluation, memory efficiency, and automatic multi-threading, designed to handle large datasets with ease.
In this session, you'll learn practical techniques for data manipulation and advanced transformations. Kimberly Fessel demonstrates Polars' syntax and capabilities, making it accessible even if you’re new to Polars.
RAGged Edge Box: A Personal AI-Powered Document Search System
One of the most popular embodiments of Generative AI are information retrieval (IR) augmented generation (RAG). Such systems use an information retrieval engine (based on semantic embeddings or keyword search) and then use a Large Language Model (LLM) to extract answers to a given query. These systems require a large amount of computation and are usually implemented in the cloud which presents data privacy issues.
In this talk Pablo presents The RAGged Edge Box project in which basic embedding systems and small local LLMs are packaged inside a multi-platform virtual machine (VirtualBox). The system provides a Web interface that runs locally and allows access to the RAG functionality in a completely private manner. The neural networks run on a ONNX runtime and do not require a GPU. RAG code is implemented in PHP and is easy to modify, requiring a much smaller execution environment than a Python alternative.
Featured Resources
Video Playlists
Data Umbrella Resources
Visit our blog site: blog.dataumbrella.org, and see articles written by our community members on their experience in recent sprints.
We have a Job Board. You can post jobs (for free)
Our Data Umbrella YouTube is growing! Subscribe to our channel to receive notifications of when our event videos are posted.
Accessibility Corner
Accessibility Update: Closed Captioning
Our webinars have closed captioning available! This feature makes our live events more accessible to those with hearing needs and for folks in general who like to see the transcript live during presentation to fully process information.
Connect with Us
Meetup: Data Umbrella & Data Umbrella Africa (*upcoming events*)
YouTube (*past recorded talks*)