Hi all,
In our final theory lunch of the quarter (!), Ofir will present "Sampling Sketches for Concave Sublinear Functions of Frequencies." As always, please join us Thursday from noon to 1 pm in Gates 463A!
----------------------------------------
Abstract:
We consider datasets that consist of elements that are key-value pairs, and our goal is to compute estimates of statistics or aggregates over the data. The contribution of each key is weighted by a function of its frequency (sum of values of its elements). This fundamental problem has a wealth of applications in data analytics and machine learning. A common approach is to maintain a sample and compute the statistics using the sample. One simple way to compute such a sample is to first aggregate the raw data to produce a table of keys and their frequencies and then apply a weighted sampling scheme. This aggregation however is too costly on massive distributed datasets with a large number of distinct keys.
An ideal sampling scheme, which allows for low-variance estimates, samples keys with probabilities proportional to their contributions. These probabilities depend on the function that is applied to the frequency of each key to compute its contribution. Our main contribution is the design of composable sampling sketches that can be tailored to any concave sublinear function of the frequencies and provide statistical guarantees on estimation quality that are very close to that of an ideal sample computed over aggregated data. Concave sublinear functions are commonly used to mitigate the disproportionate effect of keys with high frequency, and include capping functions min{x,T} (for a constant T), the moments x^p for 0
Hi all,
In our final theory lunch of the quarter (!), Ofir will present "Sampling Sketches for Concave Sublinear Functions of Frequencies." As always, please join us Thursday from noon to 1 pm in Gates 463A!
----------------------------------------
Abstract:
We consider datasets that consist of elements that are key-value pairs, and our goal is to compute estimates of statistics or aggregates over the data. The contribution of each key is weighted by a function of its frequency (sum of values of its elements). This fundamental problem has a wealth of applications in data analytics and machine learning. A common approach is to maintain a sample and compute the statistics using the sample. One simple way to compute such a sample is to first aggregate the raw data to produce a table of keys and their frequencies and then apply a weighted sampling scheme. This aggregation however is too costly on massive distributed datasets with a large number of distinct keys.
An ideal sampling scheme, which allows for low-variance estimates, samples keys with probabilities proportional to their contributions. These probabilities depend on the function that is applied to the frequency of each key to compute its contribution. Our main contribution is the design of composable sampling sketches that can be tailored to any concave sublinear function of the frequencies and provide statistical guarantees on estimation quality that are very close to that of an ideal sample computed over aggregated data. Concave sublinear functions are commonly used to mitigate the disproportionate effect of keys with high frequency, and include capping functions min{x,T} (for a constant T), the moments x^p for 0
Reminder: Saeed Seddighin's talk is today at 3:00 PM in Gates 463A.
________________________________
From: Ofir Geri
Sent: Monday, December 3, 2018 12:35:35 PM
To: thseminar at cs.stanford.edu
Subject: Two Theory Seminars This Week: Saeed Seddighin on 12/5 and Sam Hopkins on 12/7
Hi all,
This week we will have two theory seminars:
1. Saeed Seddighin (University of Maryland) on Wednesday 12/5, 3:00 PM in Gates 463A.
2. Sam Hopkins (UC Berkeley) on Friday 12/7, 3:00 PM in Gates 392.
Please see the abstracts below. If you are interested in meeting with Sam Hopkins, please email Mary at marykw at stanford.edu
Hope to see you there!
Ofir
Fast and Parallel Algorithms for Edit Distance and Longest Common Subsequence
Wednesday 12/5, 3:00 PM, Gates 463A
Speaker: Saeed Seddighin (University of Maryland)
String similarity measures are among the most fundamental problems in computer science. The notable examples are edit distance (ED) and longest common subsequence (LCS). These problems find their applications in various contexts such as computational biology, text processing, compiler optimization, data analysis, image analysis, etc. In this talk, I'll present fast and parallel algorithms for both problems. In the first part of my talk, I will present an algorithm for approximating edit distance within a constant factor in truly subquadratic time. This question has been open for 3 decades and only recently we were able to give positive answers to it.
In the second part of my talk, I will present MPC algorithms for both edit distance and longest common subsequence. These algorithms can be seen as extensions of the previous ideas to the MPC model. The algorithms are optimal with respect to round complexity, time complexity, and approximation factor.
Mean Estimation with Sub-Gaussian Rates in Polynomial Time
Friday 12/7, 3:00 PM, Gates 392
Speaker: Sam Hopkins (UC Berkeley)
We study polynomial-time algorithms for a fundamental statistics problem: estimating the mean of a random vector from i.i.d. samples. Focusing on the heavy-tailed case, we assume only that the random vector X has finite mean and covariance. In this setting, the radius of confidence intervals achieved by the empirical mean are large compared to the case that X is Gaussian or sub-Gaussian. On the other hand, estimators based on high-dimensional medians can achieve tighter confidence intervals, at the cost of potential computational intractability.
We offer the first polynomial time algorithm to estimate the mean with sub-Gaussian-size confidence intervals under such mild assumptions. Our algorithm is based on a new semidefinite programming relaxation of a high-dimensional median. Previous estimators which assumed only existence of finitely-many moments of X either sacrifice sub-Gaussian performance or are only known to be computable via brute-force search procedures requiring time exponential in the dimension.
Hi everyone,
For the next TCS+ talk*, and last of the Fall season, Julia Chuzhoy will
be speaking about an "Almost Polynomial Hardness of Node-Disjoint Paths
in Grids."
Come next Wednesday (12th) at 10am (actually, come at 9:55 for
breakfast) to see it!
Best,
-- Cl?ment
* for the people in the back: this is an online, interactive talk we can
all watch from Gates while sipping coffee and asking questions to the
speaker.
-------------------------------
Speaker: Julia Chuzhoy (TTIC)
Title: Almost Polynomial Hardness of Node-Disjoint Paths in Grids
Abstract: In the classical Node-Disjoint Paths (NDP) problem, we are
given an n-vertex graph G, and a collection of pairs of its vertices,
called demand pairs. The goal is to route as many of the demand pairs as
possible, where to route a pair we need to select a path connecting it,
so that all selected paths are disjoint in their vertices.
The best current algorithm for NDP achieves an
$O(\sqrt{n})$-approximation, while, until recently, the best negative
result was a roughly $\Omega(\sqrt{\log n})$-hardness of approximation.
Recently, an improved $2^{\Omega(\sqrt{\log n})}$-hardness of
approximation for NDP was shown, even if the underlying graph is a
subgraph of a grid graph, and all source vertices lie on the boundary of
the grid. Unfortunately, this result does not extend to grid graphs.
The approximability of NDP in grids has remained a tantalizing open
question, with the best upper bound of $\tilde{O}(n^{1/4})$, and the
best lower bound of APX-hardness. In this talk we come close to
resolving this question, by showing an almost polynomial hardness of
approximation for NDP in grid graphs.
Our hardness proof performs a reduction from the 3COL(5) problem to NDP,
using a new graph partitioning problem as a proxy. Unlike the more
standard approach of employing Karp reductions to prove hardness of
approximation, our proof is a Cook-type reduction, where, given an
input instance of 3COL(5), we produce a large number of instances of
NDP, and apply an approximation algorithm for NDP to each of them. The
construction of each new instance of NDP crucially depends on the
solutions to the previous instances that were found by the approximation
algorithm.
Joint work with David H.K. Kim and Rachit Nimavat.
Hi all,
Some of you might be interested in this conference at Stanford starting
December 13: https://sites.google.com/view/amirdembo60/
Best,
Mary
