[00:00:48] RECOVERY - dhclient process on mw1139 is OK: PROCS OK: 0 processes with command name dhclient [00:03:24] (03CR) 10Dzahn: Apache redirects for w.wiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/285932 (https://phabricator.wikimedia.org/T108557) (owner: 10Dereckson) [00:04:24] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537#2268856 (10ssastry) [00:04:32] (03CR) 10Dzahn: [C: 04-1] "seems like we want an internal rewrite instead of an external redirect, per comment from Legoktm. but the R=301 part makes it external" [puppet] - 10https://gerrit.wikimedia.org/r/285932 (https://phabricator.wikimedia.org/T108557) (owner: 10Dereckson) [00:04:55] PROBLEM - nutcracker process on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:05:24] (03CR) 10Dereckson: [C: 04-1] "Planning changes: forget redirects and create a proper vhost block for w.wiki." [puppet] - 10https://gerrit.wikimedia.org/r/285932 (https://phabricator.wikimedia.org/T108557) (owner: 10Dereckson) [00:06:36] PROBLEM - dhclient process on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:07:35] RECOVERY - salt-minion processes on mw1139 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:08:26] RECOVERY - dhclient process on mw1139 is OK: PROCS OK: 0 processes with command name dhclient [00:12:55] PROBLEM - Disk space on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:13:16] PROBLEM - salt-minion processes on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:14:06] PROBLEM - dhclient process on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:16:56] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2268905 (10Volans) a:05Volans>03None >>! In T111654#2266901, @faidon wrote: > To keep such a security feature effective and have this be a real protection against snooping, rather than... [00:18:16] 06Operations, 10RESTBase-Cassandra, 06Services: Cleanup Graphite Cassandra metrics - https://phabricator.wikimedia.org/T132771#2268907 (10Eevans) >>! In T132771#2267279, @faidon wrote: >>>! In T132771#2210653, @Eevans wrote: >>> We have around ~2 million (2.017.651) Cassandra-related metrics on Graphite. Thi... [00:18:34] 06Operations, 10RESTBase-Cassandra, 06Services, 10cassandra: Cleanup Graphite Cassandra metrics - https://phabricator.wikimedia.org/T132771#2268908 (10Eevans) [00:18:41] 06Operations, 10RESTBase-Cassandra, 06Services, 10cassandra: Cleanup Graphite Cassandra metrics - https://phabricator.wikimedia.org/T132771#2209932 (10Eevans) [00:23:35] RECOVERY - dhclient process on mw1139 is OK: PROCS OK: 0 processes with command name dhclient [00:23:47] RECOVERY - nutcracker process on mw1139 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [00:23:55] RECOVERY - DPKG on mw1139 is OK: All packages OK [00:23:55] RECOVERY - puppet last run on mw1139 is OK: OK: Puppet is currently enabled, last run 56 minutes ago with 0 failures [00:23:56] RECOVERY - Check size of conntrack table on mw1139 is OK: OK: nf_conntrack is 1 % full [00:24:16] RECOVERY - Disk space on mw1139 is OK: DISK OK [00:24:35] RECOVERY - HHVM processes on mw1139 is OK: PROCS OK: 6 processes with command name hhvm [00:24:35] RECOVERY - SSH on mw1139 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [00:24:36] RECOVERY - salt-minion processes on mw1139 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:24:46] RECOVERY - RAID on mw1139 is OK: OK: no RAID installed [00:24:47] RECOVERY - nutcracker port on mw1139 is OK: TCP OK - 0.000 second response time on port 11212 [00:24:47] RECOVERY - HHVM rendering on mw1139 is OK: HTTP OK: HTTP/1.1 200 OK - 72675 bytes in 0.101 second response time [00:25:15] RECOVERY - configured eth on mw1139 is OK: OK - interfaces up [00:25:36] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.294 second response time [00:28:47] PROBLEM - HHVM rendering on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:28:48] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537#2268916 (10GWicke) [00:29:37] PROBLEM - Apache HTTP on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:29:46] PROBLEM - DPKG on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:29:46] PROBLEM - puppet last run on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:29:57] PROBLEM - Check size of conntrack table on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:30:16] PROBLEM - Disk space on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:30:36] PROBLEM - HHVM processes on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:30:37] PROBLEM - SSH on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:30:37] PROBLEM - salt-minion processes on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:30:56] PROBLEM - RAID on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:30:56] PROBLEM - nutcracker port on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:31:16] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537#2268933 (10GWicke) [00:31:53] PROBLEM - dhclient process on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:32:11] PROBLEM - nutcracker process on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:33:08] bblack, seems like graphs are not being cached - https://phabricator.wikimedia.org/T134542 - any thoughts? [00:34:02] PROBLEM - configured eth on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:38:11] RECOVERY - HHVM processes on mw1139 is OK: PROCS OK: 6 processes with command name hhvm [00:38:11] RECOVERY - salt-minion processes on mw1139 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:42:31] RECOVERY - dhclient process on mw1139 is OK: PROCS OK: 0 processes with command name dhclient [00:42:40] RECOVERY - nutcracker port on mw1139 is OK: TCP OK - 0.000 second response time on port 11212 [00:42:50] RECOVERY - nutcracker process on mw1139 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [00:43:31] RECOVERY - configured eth on mw1139 is OK: OK - interfaces up [00:43:31] RECOVERY - DPKG on mw1139 is OK: All packages OK [00:43:41] RECOVERY - Disk space on mw1139 is OK: DISK OK [00:49:11] PROBLEM - configured eth on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:49:18] 06Operations, 06Labs, 10Labs-Infrastructure: python-designateclient package version does not match between labtestweb2001 and silver - https://phabricator.wikimedia.org/T134543#2268964 (10Krenair) [00:49:20] PROBLEM - DPKG on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:49:31] PROBLEM - Disk space on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:49:40] PROBLEM - HHVM processes on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:49:41] PROBLEM - salt-minion processes on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:50:10] PROBLEM - dhclient process on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:50:21] PROBLEM - nutcracker port on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:50:31] PROBLEM - nutcracker process on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:55:11] 06Operations, 10Beta-Cluster-Infrastructure, 10Deployment-Systems, 13Patch-For-Review, 03Scap3: Automate the generation deployment keys (keyholder-managed ssh keys) - https://phabricator.wikimedia.org/T133211#2268982 (10mmodell) Even if @faidon isn't comfortable with running ssh_keygen on production pupp... [00:55:21] RECOVERY - SSH on mw1139 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [00:55:21] RECOVERY - salt-minion processes on mw1139 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:55:42] RECOVERY - dhclient process on mw1139 is OK: PROCS OK: 0 processes with command name dhclient [00:56:00] RECOVERY - nutcracker port on mw1139 is OK: TCP OK - 0.000 second response time on port 11212 [00:56:01] RECOVERY - RAID on mw1139 is OK: OK: no RAID installed [00:56:02] RECOVERY - nutcracker process on mw1139 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [00:56:21] RECOVERY - puppet last run on mw1139 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [00:56:31] RECOVERY - Check size of conntrack table on mw1139 is OK: OK: nf_conntrack is 0 % full [00:56:41] RECOVERY - configured eth on mw1139 is OK: OK - interfaces up [00:56:50] RECOVERY - DPKG on mw1139 is OK: All packages OK [00:57:10] RECOVERY - Disk space on mw1139 is OK: DISK OK [00:57:11] RECOVERY - HHVM processes on mw1139 is OK: PROCS OK: 6 processes with command name hhvm [01:01:21] PROBLEM - SSH on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:01:21] PROBLEM - salt-minion processes on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:01:54] PROBLEM - dhclient process on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:15] PROBLEM - RAID on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:15] PROBLEM - Disk space on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:23] PROBLEM - HHVM processes on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:02:34] PROBLEM - configured eth on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:03:03] PROBLEM - nutcracker port on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:03:04] PROBLEM - nutcracker process on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:03:15] PROBLEM - puppet last run on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:04:44] PROBLEM - Check size of conntrack table on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:05:03] PROBLEM - DPKG on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:06:59] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537#2268989 (10bd808) Probably should be copied to a new page in https://wikitech.wikimedia.org/wiki/Incident_documentation as well unless you can co... [01:07:53] RECOVERY - dhclient process on mw1139 is OK: PROCS OK: 0 processes with command name dhclient [01:08:13] RECOVERY - nutcracker port on mw1139 is OK: TCP OK - 0.000 second response time on port 11212 [01:08:23] RECOVERY - nutcracker process on mw1139 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [01:08:24] RECOVERY - Check size of conntrack table on mw1139 is OK: OK: nf_conntrack is 0 % full [01:08:43] RECOVERY - DPKG on mw1139 is OK: All packages OK [01:08:55] RECOVERY - Disk space on mw1139 is OK: DISK OK [01:08:55] RECOVERY - RAID on mw1139 is OK: OK: no RAID installed [01:09:04] RECOVERY - HHVM processes on mw1139 is OK: PROCS OK: 6 processes with command name hhvm [01:09:14] RECOVERY - salt-minion processes on mw1139 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:09:14] RECOVERY - SSH on mw1139 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [01:09:23] RECOVERY - configured eth on mw1139 is OK: OK - interfaces up [01:41:03] Request from 90.180.83.194 via cp1051 cp1051, Varnish XID 2120918287 [01:41:06] Error: 503, Service Unavailable at Fri, 06 May 2016 01:40:49 GMT [01:41:10] (phabricator) [01:49:03] 06Operations, 06WMF-Legal, 07Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104#2269029 (10TerraCodes) >>! In T132104#2267669, @Dzahn wrote: > If we could actually use MediaWiki that would be lovely too. As you say, usually the skinning is the is... [02:04:10] 06Operations, 10RESTBase-Cassandra, 06Services, 10cassandra: Cleanup Graphite Cassandra metrics - https://phabricator.wikimedia.org/T132771#2269036 (10Eevans) a:03fgiunchedi [02:04:42] 06Operations, 10Education-Program-Dashboard, 10Traffic, 03Programs-and-Events-Dashboard-Sprint 2: Cache education dashboard pages - https://phabricator.wikimedia.org/T120509#2269037 (10BBlack) We don't cache labs services in production, and we don't currently (AFAIK) have any kind of cache_misc equivalent... [02:22:16] 06Operations, 10Education-Program-Dashboard, 10Traffic, 03Programs-and-Events-Dashboard-Sprint 2: Cache education dashboard pages - https://phabricator.wikimedia.org/T120509#2269061 (10awight) Wonderful, thanks for the helpful overview! I'm sure it will be fine to do an initial deployment without caching,... [02:28:46] 06Operations, 06WMF-Legal, 07Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104#2269062 (10Dzahn) @TerraCodes No, not really. WMF has hosted their own wordpress in the past for blog.wm.org, discussed it many times, it was high maintenance (regul... [02:36:14] (03CR) 10Dzahn: "can a - be at begining and end or do we want to make sure it's only in the middle ?" [puppet] - 10https://gerrit.wikimedia.org/r/287032 (https://phabricator.wikimedia.org/T134447) (owner: 10Dzahn) [02:39:36] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537#2269069 (10GWicke) @bd808, it'll be easier for us to follow up on the actionables here. I created a pointer from the wiki, so it shows up as part... [02:39:54] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.070 second response time [02:40:45] RECOVERY - HHVM rendering on mw1139 is OK: HTTP OK: HTTP/1.1 200 OK - 72158 bytes in 0.162 second response time [02:47:57] 06Operations, 06Discovery, 10Maps, 10Tilerator, and 2 others: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776#2269070 (10BBlack) Yes, invalidating 1200 tiles/sec is ridiculous no matter how we do it. We definitely agree there. First, let's explore how you're accepting and... [02:55:03] (03CR) 10BBlack: "Probably best to not have special chars in the first byte, to be conservative? So maybe a regex like '^[A-Za-z0-9][-A-Za-z0-9_]*$' ?" [puppet] - 10https://gerrit.wikimedia.org/r/287032 (https://phabricator.wikimedia.org/T134447) (owner: 10Dzahn) [02:55:35] RECOVERY - puppet last run on mw1139 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [02:59:11] 06Operations, 06Labs, 06Release-Engineering-Team, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T133968#2269087 (10lfschenone) [02:59:42] 06Operations, 10Graph, 10Graphoid, 10Traffic: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2269088 (10BBlack) [03:09:41] 06Operations, 10Graph, 10Graphoid, 10Traffic: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2269090 (10Yurik) @bblack, that explains it, thanks. I will update Graphoid to raise it to a few days on Monday. Also, the second reply is strange - instead of a miss, i... [03:13:36] 06Operations, 06WMF-Legal, 07Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104#2269104 (10Danny_B) At first, we're still missing the answer on question //How often is it (supposed to be) edited?// It would be great if somebody relevant (#wmf-leg... [03:22:00] (03PS5) 10Dzahn: acme-setup: only accept ASCII letters as unique cert ID [puppet] - 10https://gerrit.wikimedia.org/r/287032 (https://phabricator.wikimedia.org/T134447) [03:23:08] (03PS6) 10Dzahn: acme-setup: only accept ^[a-z0-9-_]+$' as unique cert ID [puppet] - 10https://gerrit.wikimedia.org/r/287032 (https://phabricator.wikimedia.org/T134447) [03:35:11] (03PS1) 10Papaul: DNS: Adding production DNS entries for maps200[1-4] Bug: T134406 [dns] - 10https://gerrit.wikimedia.org/r/287164 (https://phabricator.wikimedia.org/T134406) [03:41:29] 06Operations, 06Discovery, 10Maps, 10Tilerator, and 2 others: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776#2269122 (10Yurik) >>! In T109776#2269070, @BBlack wrote: > Are you using daily changsets? hourly? They even publish per-minute changesets. We are using daily, but... [03:43:48] 06Operations, 10ops-codfw: rack/setup/deploy maps200[1-4] switch configuration - https://phabricator.wikimedia.org/T134549#2269123 (10Papaul) [03:48:49] 06Operations, 10ops-codfw: rack/setup/deploy maps200[1-4] - https://phabricator.wikimedia.org/T134406#2269141 (10Papaul) [04:03:48] (03PS1) 10Papaul: DHCP: Add MAC Address entries for maps200[1-4] Bug: T134406 [puppet] - 10https://gerrit.wikimedia.org/r/287165 (https://phabricator.wikimedia.org/T134406) [04:11:22] (03PS1) 10Papaul: Adding install params for maps200[1-4] Bug: T134406 [puppet] - 10https://gerrit.wikimedia.org/r/287167 (https://phabricator.wikimedia.org/T134406) [04:11:48] 06Operations, 10ops-codfw: rack/setup/deploy maps200[1-4] switch configuration - https://phabricator.wikimedia.org/T134549#2269145 (10Papaul) a:05Papaul>03RobH [04:14:46] 06Operations, 10ops-codfw: rack/setup/deploy maps200[1-4] - https://phabricator.wikimedia.org/T134406#2269148 (10Papaul) [04:27:55] heads up: I'll be rolling back the wmf23 branch from Wikipedias due to a performance regression. RelEng is on board. [04:29:13] (03PS1) 10Ori.livneh: Revert "Moving remaining wikis to 1.27.0-wmf.23" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287169 [04:29:29] ostriches: +1? [04:31:35] or thcipriani [04:32:02] * thcipriani looks [04:32:38] just a simple revert, but I'd appreciate a sign-off from someone in releng [04:32:51] (03CR) 10Thcipriani: [C: 031] Revert "Moving remaining wikis to 1.27.0-wmf.23" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287169 (owner: 10Ori.livneh) [04:33:02] thanks [04:33:03] indeed. [04:33:24] (03CR) 10Ori.livneh: [C: 032] Revert "Moving remaining wikis to 1.27.0-wmf.23" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287169 (owner: 10Ori.livneh) [04:33:34] (03CR) 10Dzahn: [C: 032] DNS: Adding production DNS entries for maps200[1-4] Bug: T134406 [dns] - 10https://gerrit.wikimedia.org/r/287164 (https://phabricator.wikimedia.org/T134406) (owner: 10Papaul) [04:33:48] (03Merged) 10jenkins-bot: Revert "Moving remaining wikis to 1.27.0-wmf.23" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287169 (owner: 10Ori.livneh) [04:35:26] (03PS2) 10Dzahn: DHCP: Add MAC Address entries for maps200[1-4] Bug: T134406 [puppet] - 10https://gerrit.wikimedia.org/r/287165 (https://phabricator.wikimedia.org/T134406) (owner: 10Papaul) [04:35:33] (03CR) 10Dzahn: [C: 032] DHCP: Add MAC Address entries for maps200[1-4] Bug: T134406 [puppet] - 10https://gerrit.wikimedia.org/r/287165 (https://phabricator.wikimedia.org/T134406) (owner: 10Papaul) [04:35:50] !log ori@tin rebuilt wikiversions.php and synchronized wikiversions files: Wikipedias back to wmf.22 due to page load performance regression [04:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:38:14] (03PS2) 10Dzahn: Adding install params for maps200[1-4] Bug: T134406 [puppet] - 10https://gerrit.wikimedia.org/r/287167 (https://phabricator.wikimedia.org/T134406) (owner: 10Papaul) [04:38:58] (03CR) 10Dzahn: [C: 032] Adding install params for maps200[1-4] Bug: T134406 [puppet] - 10https://gerrit.wikimedia.org/r/287167 (https://phabricator.wikimedia.org/T134406) (owner: 10Papaul) [04:42:21] 06Operations, 10ops-codfw: rack/setup/deploy maps200[1-4] - https://phabricator.wikimedia.org/T134406#2269179 (10Dzahn) 21:38 < grrrit-wm> (CR) Dzahn: [C: 2] DNS: Adding production DNS entries for maps200[1-4] Bug: T134406 [dns] - https://gerrit.wikimedia.org/r/287164 (https://phabricator.w... [04:51:43] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [05:05:32] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: Create functional cluster checks for all services (and have them page!) - https://phabricator.wikimedia.org/T134551#2269184 (10Joe) [05:05:40] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [05:05:58] !log Update cxserver to 155c2d4 [05:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:21:39] PROBLEM - Apache HTTP on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:21:49] 06Operations, 10Traffic: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2269203 (10BBlack) It passed the IESG! https://datatracker.ietf.org/doc/draft-ietf-tls-chacha20-poly1305/history/ Now just waiting on the official announcement and making it through the editor queu... [05:23:21] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.044 second response time [05:26:20] 06Operations, 10Graph, 10Graphoid, 10Traffic: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2269204 (10BBlack) I get normal misses and hits trying that second one incognito. pass+chfp behavior usually means Varnish refused to cache (or use existing cache) your r... [05:35:09] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/0/0: down - Peering: Equinix Chicago (SR 17915277) {#11374} [10Gbps DF]BR [05:36:38] 06Operations, 10Graph, 10Graphoid, 10Traffic: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2269205 (10Yurik) That is very strange - all three graphs are on the same page, generated in a similar way using the same code, and only the middle one is consistently giv... [05:37:00] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [05:57:01] !log restarting elasticsearch server elastic1010.eqiad.wmnet (T110236) [05:57:02] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [05:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:15:51] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Puppet has 1 failures [06:21:23] (03CR) 10Jcrespo: MariaDB: set $master true for codfw masters (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/287144 (https://phabricator.wikimedia.org/T134481) (owner: 10Volans) [06:30:39] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:01] PROBLEM - puppet last run on mw2136 is CRITICAL: CRITICAL: puppet fail [06:31:39] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:00] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:09] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:49] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:10] PROBLEM - puppet last run on db2051 is CRITICAL: CRITICAL: Puppet has 1 failures [06:42:40] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:00] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:09] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:57:30] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:01] RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:58:39] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:40] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:05:07] RECOVERY - puppet last run on db2051 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:05:45] damn it [07:05:52] Your message to Engineering awaits moderator approval [07:05:55] because it was too large [07:07:47] (03PS3) 10Muehlenhoff: Amend imagemagick policy to also include the URL decoder [puppet] - 10https://gerrit.wikimedia.org/r/286790 [07:09:40] (03PS1) 10Jcrespo: Repool db1065 after maintenance; repool db1023 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287174 [07:11:05] (03CR) 10Jcrespo: [C: 032] Repool db1065 after maintenance; repool db1023 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287174 (owner: 10Jcrespo) [07:11:29] (03Merged) 10jenkins-bot: Repool db1065 after maintenance; repool db1023 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287174 (owner: 10Jcrespo) [07:12:45] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1065 after maintenance; repool db1023 with low weight (duration: 00m 34s) [07:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:13:29] 06Operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 06Services, and 2 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2269256 (10KartikMistry) For rebuilding packages, we need to rebuild each package manually. I'll start with one... [07:14:17] 06Operations, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, 06Services, and 2 others: Package and test apertium for Jessie - https://phabricator.wikimedia.org/T107306#2269260 (10KartikMistry) [07:15:43] 06Operations, 10Phabricator, 06Project-Admins, 06Triagers: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#2269261 (10mmodell) >>! In T706#2268462, @Yurik wrote: > Please add me as I manage several projects (graphs, maps, tabular data, ...). Tha... [07:44:38] 06Operations, 10DBA: db1033 (old s7 master) needs backup and reimage - https://phabricator.wikimedia.org/T134555#2269287 (10jcrespo) [07:51:53] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2269304 (10elukey) Stats for today: ``` elukey@mc1008:~$ echo stats | nc localhost 11211 STAT pid 4762 STAT uptime 61112 STAT time 1462520416 STAT version 1.4.... [07:52:34] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review: Spark yarn in client mode is never moved from ACCEPTED to RUNNING - https://phabricator.wikimedia.org/T134422#2269305 (10MoritzMuehlenhoff) [07:59:00] (03PS1) 10Muehlenhoff: Add dstrine to bastiononly, researchers, statistics-privatedata-users, statistics-users groups [puppet] - 10https://gerrit.wikimedia.org/r/287175 (https://phabricator.wikimedia.org/T133953) [08:00:45] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for Dstrine - https://phabricator.wikimedia.org/T133953#2250003 (10MoritzMuehlenhoff) Approved by manager, waiting period has passed. [08:05:14] !log restarting elasticsearch server elastic1011.eqiad.wmnet (T110236) [08:05:15] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [08:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:06:58] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add dstrine to bastiononly, researchers, statistics-privatedata-users, statistics-users groups [puppet] - 10https://gerrit.wikimedia.org/r/287175 (https://phabricator.wikimedia.org/T133953) (owner: 10Muehlenhoff) [08:13:39] !log delete blacklisted cassandra metrics for restbase meta tables T134016 [08:13:40] T134016: RESTBase Cassandra cluster: Increase instance count from 2 to 3 - https://phabricator.wikimedia.org/T134016 [08:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:15:14] 06Operations, 10DBA: db1033 (old s7 master) needs backup and reimage - https://phabricator.wikimedia.org/T134555#2269322 (10jcrespo) @faidon At some point you personally offered help due to my large workload. **This ticket is my responsability**, but I would like to ask you to perform this one ticket (once) fo... [08:15:23] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for Dstrine - https://phabricator.wikimedia.org/T133953#2269324 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff @DStrine I merged a patch to enable your access. Catch me on IRC or... [08:15:42] 06Operations, 10DBA: db1033 (old s7 master) needs backup and reimage - https://phabricator.wikimedia.org/T134555#2269328 (10jcrespo) a:05jcrespo>03None [08:16:17] 06Operations, 10DBA: db1033 (old s7 master) needs backup and reimage - https://phabricator.wikimedia.org/T134555#2269287 (10jcrespo) [08:22:32] 06Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: Add amire80 to analytics-privatedata-users group - https://phabricator.wikimedia.org/T122524#2269330 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [08:23:35] (03PS1) 10Ori.livneh: Remove duplicate lru_crawler option from mc[12]009 [puppet] - 10https://gerrit.wikimedia.org/r/287178 [08:24:06] 06Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: Add amire80 to analytics-privatedata-users group - https://phabricator.wikimedia.org/T122524#1906445 (10MoritzMuehlenhoff) @Arrbee : Runa, you're Amir's manager, right? Please conf... [08:28:42] (03CR) 10Filippo Giunchedi: [C: 04-1] "generally ok, python nits" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/287121 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [08:29:31] (03PS1) 10Muehlenhoff: Add amire80 to analytics-privatedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/287179 (https://phabricator.wikimedia.org/T122524) [08:30:39] (03CR) 10Muehlenhoff: [C: 04-2] "Don't merge yet, needs manager approval and the comment waiting period needs to pass." [puppet] - 10https://gerrit.wikimedia.org/r/287179 (https://phabricator.wikimedia.org/T122524) (owner: 10Muehlenhoff) [08:33:53] (03PS1) 10Filippo Giunchedi: cassandra: add restbase2009-b [puppet] - 10https://gerrit.wikimedia.org/r/287180 (https://phabricator.wikimedia.org/T132976) [08:35:21] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase2009-b [puppet] - 10https://gerrit.wikimedia.org/r/287180 (https://phabricator.wikimedia.org/T132976) (owner: 10Filippo Giunchedi) [08:36:49] !log restarting elasticsearch server elastic1012.eqiad.wmnet (T110236) [08:36:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:37:27] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [08:38:28] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2269348 (10JAllemandou) @ottomata: TL;DR: We already have replication factor of 3 :) Details: Double checked on cassandra-aqs: every keyspace we use has `replication = {'class': 'NetworkT... [08:41:19] 06Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics, and 2 others: Add amire80 to analytics-privatedata-users group - https://phabricator.wikimedia.org/T122524#2269349 (10Arrbee) Hi @MoritzMuehlenhoff , I would like to confirm that this request is approved for @Amire80 . Thanks. [08:42:19] (03PS1) 10Muehlenhoff: Add fonts-smc (Malayalam) to image/video scalers [puppet] - 10https://gerrit.wikimedia.org/r/287181 (https://phabricator.wikimedia.org/T33950) [08:43:08] (03CR) 10Volans: "See my inline comments" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/287144 (https://phabricator.wikimedia.org/T134481) (owner: 10Volans) [08:43:10] 06Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics, and 2 others: Add amire80 to analytics-privatedata-users group - https://phabricator.wikimedia.org/T122524#2269353 (10MoritzMuehlenhoff) Thanks, I'll merge this on Monday. [08:48:10] 06Operations, 10Traffic, 10Wikidata: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2269379 (10hoo) Another report about this (probably) on Wikitech: https://lists.wikimedia.org/pipermail/wikitech-l/2016-May/085526.html [08:49:36] 06Operations, 10Traffic: Something in WMF infrastructure corrupts responses with certain lengths - https://phabricator.wikimedia.org/T132159#2269398 (10hoo) [08:49:39] 06Operations, 10Traffic, 10Wikidata: Varnish seems to sometimes mangle uncompressed API results - https://phabricator.wikimedia.org/T133866#2269400 (10hoo) [08:57:37] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2269424 (10ori) We are seeing some pretty pathological behavior. Both mc1008 and mc1009 have the same slab growth factor, but mc1008 has 182 slabs (with a good... [09:06:19] PROBLEM - cassandra-b CQL 10.192.48.55:9042 on restbase2009 is CRITICAL: Connection refused [09:06:49] ACKNOWLEDGEMENT - cassandra-b CQL 10.192.48.55:9042 on restbase2009 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [09:12:18] 06Operations, 10RESTBase-Cassandra, 06Services, 10cassandra: Cleanup Graphite Cassandra metrics - https://phabricator.wikimedia.org/T132771#2269425 (10fgiunchedi) >>! In T132771#2268907, @Eevans wrote: > I added support for whitelists (to compliment the existing support for blacklisting) to cassandra-metri... [09:17:32] !log restarted hhvm on mw1148 [09:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:18:58] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.229 second response time [09:19:28] RECOVERY - HHVM rendering on mw1148 is OK: HTTP OK: HTTP/1.1 200 OK - 72182 bytes in 0.373 second response time [09:26:02] (03PS1) 10Jcrespo: Depool db1070 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287183 (https://phabricator.wikimedia.org/T134360) [10:03:16] PROBLEM - NTP on lvs3003 is CRITICAL: NTP CRITICAL: No response from NTP server [10:07:35] PROBLEM - NTP on lvs1001 is CRITICAL: NTP CRITICAL: No response from NTP server [10:10:44] looking into the NTP warnings [10:13:16] caused by T126733 [10:13:16] T126733: ntp restart sometimes unrealiable - https://phabricator.wikimedia.org/T126733 [10:15:16] RECOVERY - NTP on lvs1001 is OK: NTP OK: Offset -0.002312779427 secs [10:16:46] RECOVERY - NTP on lvs3003 is OK: NTP OK: Offset -0.000323176384 secs [10:32:39] (03CR) 10Alexandros Kosiaris: [C: 031] ircserver: move ircd.conf to public repo [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) (owner: 10Dzahn) [10:50:05] (03PS1) 10Filippo Giunchedi: symlink /.well-known/apple-app-site-association to /apple-app-site-association [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287190 (https://phabricator.wikimedia.org/T130647) [10:59:19] !log restarting slapd on pollux to pick up openssl update [10:59:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:02:30] !log Overwrote property suggester data with data from the 20160215 dump (T132839) [11:02:31] T132839: Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [11:02:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:09:53] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: puppet fail [11:10:01] !log Reverted the property suggester data to data from the 20160411 dump (done testing T132839) [11:10:01] T132839: Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [11:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:15:01] !log restarting slapd on dubnium to pick up openssl update [11:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:29:23] PROBLEM - NTP on es2018 is CRITICAL: NTP CRITICAL: No response from NTP server [11:30:12] PROBLEM - NTP on es2011 is CRITICAL: NTP CRITICAL: No response from NTP server [11:30:22] PROBLEM - NTP on es2016 is CRITICAL: NTP CRITICAL: No response from NTP server [11:31:42] PROBLEM - NTP on es2012 is CRITICAL: NTP CRITICAL: No response from NTP server [11:32:42] PROBLEM - NTP on es2013 is CRITICAL: NTP CRITICAL: No response from NTP server [11:33:33] RECOVERY - NTP on es2018 is OK: NTP OK: Offset 0.0005171298981 secs [11:34:13] RECOVERY - NTP on es2011 is OK: NTP OK: Offset -0.0003994703293 secs [11:35:42] RECOVERY - NTP on es2012 is OK: NTP OK: Offset 0.001216769218 secs [11:36:22] RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [11:36:23] RECOVERY - NTP on es2016 is OK: NTP OK: Offset -0.003111839294 secs [11:37:42] PROBLEM - puppet last run on mw2074 is CRITICAL: CRITICAL: Puppet has 1 failures [11:38:33] RECOVERY - NTP on es2013 is OK: NTP OK: Offset -0.0002655982971 secs [11:42:02] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 645 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5880489 keys - replication_delay is 645 [11:47:48] !log restarting aqs on aqs100[123] for security upgrades. [11:47:53] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5862062 keys - replication_delay is 0 [11:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:52:25] !log rolling restart of restbase in codfw for openssl update [11:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:58:29] (03PS4) 10Muehlenhoff: Amend imagemagick policy to also include the URL decoder [puppet] - 10https://gerrit.wikimedia.org/r/286790 [11:59:59] (03CR) 10Mobrovac: Text VCL: RB ?redirect=false optimization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/287104 (https://phabricator.wikimedia.org/T134464) (owner: 10BBlack) [12:00:03] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [12:02:31] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5862441 keys - replication_delay is 0 [12:03:33] (03CR) 10Muehlenhoff: [C: 032 V: 032] Amend imagemagick policy to also include the URL decoder [puppet] - 10https://gerrit.wikimedia.org/r/286790 (owner: 10Muehlenhoff) [12:03:41] RECOVERY - puppet last run on mw2074 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [12:04:48] 06Operations, 06WMF-Legal, 07Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104#2269668 (10Peachey88) >>! In T132104#2267669, @Dzahn wrote: > If we could actually use MediaWiki that would be lovely too. As you say, usually the skinning is the iss... [12:19:13] (03CR) 10Faidon Liambotis: [C: 031] ircserver: move ircd.conf to public repo [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) (owner: 10Dzahn) [12:22:01] PROBLEM - Apache HTTP on mw1226 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.013 second response time [12:23:01] PROBLEM - HHVM rendering on mw1226 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50404 bytes in 0.009 second response time [12:28:49] !log restbase rolling restart in eqiad for openssl update [12:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:30:30] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: puppet fail [12:30:44] (03CR) 10Alexandros Kosiaris: "@Dzahn, that's actually a good question. So, technically, url-downloader differs from webproxy in that it provides no caching (and has som" [puppet] - 10https://gerrit.wikimedia.org/r/287077 (owner: 10Alexandros Kosiaris) [12:40:00] RECOVERY - Apache HTTP on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.105 second response time [12:40:06] !log restarted hhvm on mw1226 (hhvm-dump-debug output available) [12:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:41:01] RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 72169 bytes in 0.239 second response time [12:43:35] 06Operations, 10Deployment-Systems, 06Release-Engineering-Team, 06Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#2269737 (10akosiaris) Seems like we have different meanings for #Blocked-on-operations. For me the tag means "somehow operations i... [12:53:07] 06Operations, 10ops-eqiad, 13Patch-For-Review: Decommission broken db1058 - https://phabricator.wikimedia.org/T134360#2269749 (10Cmjohnson) @Southparkfan We have a pretty strict way of removing servers. It is all documented here https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_or_Decommission.... [12:56:27] 06Operations, 10Graph, 10Graphoid, 10Traffic: Graph results are not being cached in Varnish - https://phabricator.wikimedia.org/T134542#2269755 (10Yurik) [12:58:20] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:02:14] (03CR) 10Jcrespo: [C: 032] Depool db1070 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287183 (https://phabricator.wikimedia.org/T134360) (owner: 10Jcrespo) [13:06:05] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1070 for maintenance (duration: 00m 29s) [13:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:11:32] (03PS1) 10Jcrespo: Reimage db1070 as jessie [puppet] - 10https://gerrit.wikimedia.org/r/287206 [13:12:06] (03PS2) 10Jcrespo: Reimage db1070 as jessie [puppet] - 10https://gerrit.wikimedia.org/r/287206 [13:16:34] mutante: https://phabricator.wikimedia.org/P3001#13591 [13:16:55] (03CR) 10Jcrespo: [C: 032] Reimage db1070 as jessie [puppet] - 10https://gerrit.wikimedia.org/r/287206 (owner: 10Jcrespo) [13:17:01] 06Operations, 10ops-eqiad, 13Patch-For-Review: Decommission broken db1058 - https://phabricator.wikimedia.org/T134360#2269809 (10Southparkfan) @Cmjohnson yeah, perhaps I have been a bit too fast by already doing the DNS part (despite that's the only thing I can do it seems) :-) Anyway, ops know more than me... [13:18:41] 06Operations, 10ops-ulsfo: cp4016: bad power supply - https://phabricator.wikimedia.org/T134526#2269813 (10Cmjohnson) Dear Johnson, Christopher, Your dispatch shipped on 5/5/2016 6:34:04 PM What's Next? If you need to make any changes to the dispatch contact information, please visit our Support Center or... [13:19:41] thcipriani|afk: sadly I wasn't able to get to scap this week, any urgent change that should be deployed? [13:20:30] 06Operations, 10ops-ulsfo: cp4016: bad power supply - https://phabricator.wikimedia.org/T134526#2269831 (10Cmjohnson) a:05Cmjohnson>03RobH @robh please receive in and make note of return tracking number for the service you use....USPS is top and FEDEX is bottom and assign back to me so I can update the po... [13:23:28] godog: nothing super urgent that needs to get out. The bump from 3.1 to 3.2 is due to moving to scap subcommands (e.g. scap sync vs. scap) which is probably the biggest change in the new version. [13:24:17] thcipriani|afk: ack, thanks! I'll poke at it on monday then [13:24:27] thanks! [13:25:58] yw! [13:26:26] !log reimaging db1070 [13:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:49:22] (03PS2) 10Filippo Giunchedi: monitoring: report reference name on uncommitted changes [puppet] - 10https://gerrit.wikimedia.org/r/285924 [13:49:28] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] monitoring: report reference name on uncommitted changes [puppet] - 10https://gerrit.wikimedia.org/r/285924 (owner: 10Filippo Giunchedi) [13:49:45] 06Operations, 06Labs, 10Labs-Infrastructure: python-designateclient package version does not match between labtestweb2001 and silver - https://phabricator.wikimedia.org/T134543#2269922 (10Andrew) 05Open>03Invalid Horizon is designed to be backwards-compatible with different OpenStack API versions, and we... [13:51:08] (03PS5) 10Rush: Increase the cache size for the Labs dns recursor [puppet] - 10https://gerrit.wikimedia.org/r/286897 (https://phabricator.wikimedia.org/T124680) (owner: 10Andrew Bogott) [13:53:06] !log restarting elasticsearch server elastic1013.eqiad.wmnet (T110236) [13:54:11] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [13:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:57:02] (03PS1) 10Ottomata: Add deployment-kafka03 to list of analytics kafka brokers in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/287215 (https://phabricator.wikimedia.org/T121562) [13:58:21] (03CR) 10Ottomata: [C: 032] Add deployment-kafka03 to list of analytics kafka brokers in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/287215 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [14:01:36] heya urandom, yt? [14:04:52] (03CR) 10BBlack: Text VCL: RB ?redirect=false optimization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/287104 (https://phabricator.wikimedia.org/T134464) (owner: 10BBlack) [14:10:21] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 4 others: Make RB ?redirect=false cache-efficient - https://phabricator.wikimedia.org/T134464#2269992 (10BBlack) So, from https://gerrit.wikimedia.org/r/287104 code review comments: apparently there's a second case here not discussed in T118548 ? "... [14:15:38] (03PS6) 10Rush: Increase the cache size for the Labs dns recursor [puppet] - 10https://gerrit.wikimedia.org/r/286897 (https://phabricator.wikimedia.org/T124680) (owner: 10Andrew Bogott) [14:16:20] (03CR) 10Rush: [C: 032 V: 032] Increase the cache size for the Labs dns recursor [puppet] - 10https://gerrit.wikimedia.org/r/286897 (https://phabricator.wikimedia.org/T124680) (owner: 10Andrew Bogott) [14:20:48] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 4 others: Make RB ?redirect=false cache-efficient - https://phabricator.wikimedia.org/T134464#2270033 (10mobrovac) >>! In T134464#2269992, @BBlack wrote: > So, from https://gerrit.wikimedia.org/r/287104 code review comments: apparently there's a sec... [14:22:26] (03PS6) 10Filippo Giunchedi: prometheus: add node_exporter support [puppet] - 10https://gerrit.wikimedia.org/r/276243 (https://phabricator.wikimedia.org/T92813) [14:23:50] (03CR) 10jenkins-bot: [V: 04-1] prometheus: add node_exporter support [puppet] - 10https://gerrit.wikimedia.org/r/276243 (https://phabricator.wikimedia.org/T92813) (owner: 10Filippo Giunchedi) [14:26:31] PROBLEM - HTTPS-toolserver on www.toolserver.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection timed out [14:26:42] (03PS7) 10Filippo Giunchedi: prometheus: add node_exporter support [puppet] - 10https://gerrit.wikimedia.org/r/276243 (https://phabricator.wikimedia.org/T92813) [14:29:32] moritzm: could that www.toolserver.org ssl issue above be from updates and needed restart? I think that's a landinge page holdover from old toolserver on a VM in labs [14:30:42] chasemp: if it's running on a labs instance, probably not. I didn't update any of these [14:30:48] kk [14:30:51] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 656 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5869742 keys - replication_delay is 656 [14:31:52] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1058.eqiad.wmnet:3306 - retry-time: 60 retries: 86400 message: Cant connect to MySQL server on db1058.eqiad.wmnet (111 Connection refused) [14:31:53] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 195100.86 seconds [14:32:12] (03PS2) 10Ottomata: Set analytics kafka broker info for labs deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287106 (https://phabricator.wikimedia.org/T121562) [14:33:08] (03PS3) 10Ottomata: Set analytics kafka broker info for labs deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287106 (https://phabricator.wikimedia.org/T121562) [14:35:33] !log demon@tin Synchronized php-1.27.0-wmf.22/extensions/CentralAuth: Backporting T134246 (duration: 00m 38s) [14:35:34] T134246: Changing the email addresses sends both emails to the old address, none to the new address - https://phabricator.wikimedia.org/T134246 [14:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:35:40] anomie: Backported to wmf.22 ^^ [14:35:47] ostriches: Thanks! [14:36:11] np, thanks for spotting. [14:37:09] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [14:41:28] PROBLEM - Host db1058 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:58] !log restarting elasticsearch server elastic1014.eqiad.wmnet (T110236) [14:45:59] T110236: Use unicast instead of multicast for node communication - https://phabricator.wikimedia.org/T110236 [14:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:46:48] !log restarting exim on MX servers to pick up openssl update [14:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:47:30] (03CR) 10DCausse: [C: 031] Set analytics kafka broker info for labs deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287106 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [14:48:46] chasemp: FYI I prodded strontium to update its git clone, it failed on puppet-merge [14:49:03] godog: ahhhh thank you [14:49:10] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [14:49:44] (03CR) 10Ottomata: [C: 032] Set analytics kafka broker info for labs deployment-prep [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287106 (https://phabricator.wikimedia.org/T121562) (owner: 10Ottomata) [14:49:44] np, there's https://phabricator.wikimedia.org/T128895 open for it but haven't found the time to poke more [14:52:20] (03PS1) 10Andrew Bogott: Fixes to the 'makedomain' script. [puppet] - 10https://gerrit.wikimedia.org/r/287223 [14:56:25] 06Operations, 10Analytics, 10Traffic, 07Privacy: Connect Hadoop records of the same request coming via different channels - https://phabricator.wikimedia.org/T113817#2270134 (10Nuria) Before adding any more data pieces (and agreed with @BBlack that this ticket needs more clarification) I would like to make... [14:56:44] (03PS1) 10Jcrespo: Retire db1058 from the service group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287224 (https://phabricator.wikimedia.org/T134360) [14:56:55] !log dcausse@tin Synchronized wmf-config/LabsServices.php: Set analytics kafka broker info for labs deployment-prep (duration: 00m 33s) [14:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:15:28] 06Operations, 10hardware-requests: new labstore hardware for eqiad - https://phabricator.wikimedia.org/T126089#2270199 (10MoritzMuehlenhoff) [15:15:30] 06Operations, 10ops-eqiad, 06DC-Ops: testing: r430 server / h800 controller / md1200 shelf - https://phabricator.wikimedia.org/T127490#2270196 (10MoritzMuehlenhoff) 05Resolved>03Open But the server is still up (and in puppet/salt)? root@neodymium:~# salt wmf* cmd.run 'uptime' wmf4727-test.eqiad.wmnet:... [15:21:20] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2270212 (10Ottomata) Ah ok cool. @Cmjohnson we are good to go on these then! [15:36:13] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [15:37:03] 06Operations, 06Analytics-Kanban, 10DNS, 10Traffic, 13Patch-For-Review: Create analytics.wikimedia.org - https://phabricator.wikimedia.org/T132407#2270254 (10Nuria) 05Open>03Resolved [15:39:03] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1698 bytes in 0.193 second response time [15:43:01] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2270265 (10mark) [15:43:03] 06Operations, 10Analytics, 10hardware-requests, 13Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#2270266 (10mark) [15:44:53] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1691 bytes in 0.175 second response time [15:45:24] (03CR) 10Alex Monk: [C: 031] Fixes to the 'makedomain' script. [puppet] - 10https://gerrit.wikimedia.org/r/287223 (owner: 10Andrew Bogott) [15:48:27] 06Operations, 10ops-eqiad, 06Analytics-Kanban: rack/setup/deploy aqs100[456] - https://phabricator.wikimedia.org/T133785#2270278 (10Cmjohnson) @ottomata: yes...just need to add the dhcpd and partman but feel free if you have time [15:49:32] (03CR) 10Filippo Giunchedi: prometheus: add node_exporter support (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/276243 (https://phabricator.wikimedia.org/T92813) (owner: 10Filippo Giunchedi) [15:50:51] (03CR) 10Andrew Bogott: [C: 032] Fixes to the 'makedomain' script. [puppet] - 10https://gerrit.wikimedia.org/r/287223 (owner: 10Andrew Bogott) [15:55:03] (03PS1) 10Rush: labs pdns updates [puppet] - 10https://gerrit.wikimedia.org/r/287233 [15:59:23] (03CR) 10Alex Monk: [C: 031] symlink /.well-known/apple-app-site-association to /apple-app-site-association [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287190 (https://phabricator.wikimedia.org/T130647) (owner: 10Filippo Giunchedi) [16:00:50] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537#2270298 (10GWicke) [16:12:02] (03PS7) 10Alex Monk: [WIP] Diamond collector for nagios plugin return codes [puppet] - 10https://gerrit.wikimedia.org/r/287121 (https://phabricator.wikimedia.org/T111064) [16:14:03] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2270343 (10elukey) [16:14:05] 06Operations, 06Analytics-Kanban, 13Patch-For-Review: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2270342 (10elukey) 05Open>03Resolved [16:15:41] (03Abandoned) 10Jcrespo: [WIP] Script to generate openssh TLS keys for mysql replication [software] - 10https://gerrit.wikimedia.org/r/247542 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [16:19:09] (03CR) 10Filippo Giunchedi: prometheus: add node_exporter support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/276243 (https://phabricator.wikimedia.org/T92813) (owner: 10Filippo Giunchedi) [16:19:37] (03PS1) 10Andrew Bogott: Remove labcontrol2001 hiera def [puppet] - 10https://gerrit.wikimedia.org/r/287234 [16:19:39] (03PS1) 10Andrew Bogott: Remove ldap/dns services from labcontrol1001 and labcontrol1002 [puppet] - 10https://gerrit.wikimedia.org/r/287235 (https://phabricator.wikimedia.org/T126758) [16:19:42] (03PS1) 10Andrew Bogott: Purge labs dns/ldap code [puppet] - 10https://gerrit.wikimedia.org/r/287236 (https://phabricator.wikimedia.org/T126758) [16:21:08] (03Abandoned) 10Elukey: Configure mc1009 with the latest memcached version as performance test. [puppet] - 10https://gerrit.wikimedia.org/r/287058 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [16:21:59] (03PS1) 10Elukey: Restore basic memcached settings to mc1009 as part of a performance test. [puppet] - 10https://gerrit.wikimedia.org/r/287237 (https://phabricator.wikimedia.org/T129963) [16:24:49] (03PS8) 10Filippo Giunchedi: prometheus: add node_exporter support [puppet] - 10https://gerrit.wikimedia.org/r/276243 (https://phabricator.wikimedia.org/T92813) [16:25:03] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2270365 (10elukey) @ori, @Joe: I posted a code review to make mc1009 running with only slab_reassign (keeping the if/else structure so we'll be able to add easi... [16:25:57] (03PS1) 10Andrew Bogott: Remove dns entries for the old ldap/dns servers [dns] - 10https://gerrit.wikimedia.org/r/287238 (https://phabricator.wikimedia.org/T126758) [16:26:29] (03PS2) 10Andrew Bogott: Remove dns entries for the old ldap/dns servers [dns] - 10https://gerrit.wikimedia.org/r/287238 (https://phabricator.wikimedia.org/T126758) [16:26:53] (03CR) 10Andrew Bogott: [C: 032] Remove labcontrol2001 hiera def [puppet] - 10https://gerrit.wikimedia.org/r/287234 (owner: 10Andrew Bogott) [16:27:54] (03PS3) 10Alex Monk: Add basic contact form for stewards [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225509 (https://phabricator.wikimedia.org/T98625) [16:29:16] (03PS8) 10Alex Monk: Diamond collector for nagios plugin return codes [puppet] - 10https://gerrit.wikimedia.org/r/287121 (https://phabricator.wikimedia.org/T111064) [16:29:23] (03PS9) 10Alex Monk: Diamond collector for nagios plugin return codes [puppet] - 10https://gerrit.wikimedia.org/r/287121 (https://phabricator.wikimedia.org/T111064) [16:42:28] (03PS3) 10Andrew Bogott: Remove dns entries for the old ldap/dns servers [dns] - 10https://gerrit.wikimedia.org/r/287238 (https://phabricator.wikimedia.org/T126758) [16:42:30] (03PS1) 10Andrew Bogott: Removed the transitional labs-ns2 and labs-ns3 definitions. [dns] - 10https://gerrit.wikimedia.org/r/287245 (https://phabricator.wikimedia.org/T126758) [16:51:57] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537#2270455 (10GWicke) [16:55:05] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, and 2 others: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2270462 (10Papaul) [16:55:07] 06Operations, 10ops-codfw: sinistra - RAID failure - https://phabricator.wikimedia.org/T134187#2270458 (10Papaul) 05Open>03Resolved a:05Papaul>03Dzahn @Dzahn Drive replacement complete. [16:58:10] 06Operations, 10ops-codfw: rack/setup/deploy maps200[1-4] - https://phabricator.wikimedia.org/T134406#2270466 (10RobH) [16:58:12] 06Operations, 10ops-codfw: rack/setup/deploy maps200[1-4] switch configuration - https://phabricator.wikimedia.org/T134549#2270464 (10RobH) 05Open>03Resolved All switch port descriptions set to hostnames, enabled, and added to the private vlan for each row. [16:58:34] papaul: ^ switch ports done! [16:59:19] (03PS1) 10Alex Monk: udpmxircecho: use a config file [puppet] - 10https://gerrit.wikimedia.org/r/287246 [16:59:21] (03PS1) 10Alex Monk: udpmxircecho: Move from template to file [puppet] - 10https://gerrit.wikimedia.org/r/287247 [16:59:32] robh: thanks rob [16:59:38] welcome =] [17:00:35] (03CR) 10jenkins-bot: [V: 04-1] udpmxircecho: Move from template to file [puppet] - 10https://gerrit.wikimedia.org/r/287247 (owner: 10Alex Monk) [17:01:50] (03PS2) 10Alex Monk: udpmxircecho: use a config file [puppet] - 10https://gerrit.wikimedia.org/r/287246 [17:02:28] (03PS2) 10Alex Monk: udpmxircecho: Move from template to file [puppet] - 10https://gerrit.wikimedia.org/r/287247 [17:10:31] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5871600 keys - replication_delay is 0 [17:15:20] PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: Puppet has 1 failures [17:27:20] (03CR) 10Alex Monk: "May 06 17:24:01 deployment-ircd ircd[4966]: ERROR: No server name specified in serverinfo block." [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) (owner: 10Dzahn) [17:28:30] /query mutante [17:28:33] oops [17:33:38] (03CR) 10Alex Monk: "Adding this block to the bottom of modules/secret/secrets/mw_rc_irc/auth.conf (in labs/private.git) did the trick:" [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) (owner: 10Dzahn) [17:42:59] RECOVERY - puppet last run on cp2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:51:50] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 708 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5876018 keys - replication_delay is 708 [17:53:04] papaul: sinistra's 4th disk isnt showing [17:53:11] did it power up and show a green led? [17:53:51] I ask since it isn't showing in /dev/ or in fdisk [17:53:59] but you replaced disk (trying to avoid rebooting) [17:59:49] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:02:40] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:02:40] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:07:50] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:10:29] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [18:10:30] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [18:11:20] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=817.20 Read Requests/Sec=1770.50 Write Requests/Sec=0.70 KBytes Read/Sec=25697.20 KBytes_Written/Sec=19.60 [18:11:40] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [18:19:00] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=938.90 Read Requests/Sec=772.20 Write Requests/Sec=5.50 KBytes Read/Sec=3102.80 KBytes_Written/Sec=178.80 [18:19:21] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:25:47] (03PS1) 10Dzahn: mw_rc_irc: add missing auth.conf snippet [labs/private] - 10https://gerrit.wikimedia.org/r/287254 [18:27:51] (03CR) 10Dzahn: [C: 032 V: 032] "fixes running it in labs" [labs/private] - 10https://gerrit.wikimedia.org/r/287254 (owner: 10Dzahn) [18:28:26] (03CR) 10Dzahn: "checked in prod. private, that snippet is there at the end of auth.conf indeed. was just missing in labs/private then, thanks for testing " [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) (owner: 10Dzahn) [18:31:02] (03CR) 10GWicke: [C: 031] Text VCL: RB ?redirect=false optimization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/287104 (https://phabricator.wikimedia.org/T134464) (owner: 10BBlack) [18:31:31] (03CR) 10GWicke: "Thank you, @bblack! This looks good to go to me." [puppet] - 10https://gerrit.wikimedia.org/r/287104 (https://phabricator.wikimedia.org/T134464) (owner: 10BBlack) [18:32:41] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=249.70 Read Requests/Sec=288.60 Write Requests/Sec=5.60 KBytes Read/Sec=1164.00 KBytes_Written/Sec=466.40 [18:32:49] 06Operations, 10Parsoid, 10RESTBase, 06Services-next, and 4 others: Make RB ?redirect=false cache-efficient - https://phabricator.wikimedia.org/T134464#2270692 (10GWicke) 301s (returned for title normalization) are always safe to cache (and we do send the corresponding headers). Those redirects are applied... [19:12:31] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537#2270759 (10GWicke) [19:14:16] 06Operations, 06Parsing-Team, 06Services, 03Mobile-Content-Service: ChangeProp / RESTBase / Parsoid outage 2016-05-05 - https://phabricator.wikimedia.org/T134537#2268779 (10GWicke) [19:25:07] (03CR) 10GWicke: Text VCL: RB ?redirect=false optimization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/287104 (https://phabricator.wikimedia.org/T134464) (owner: 10BBlack) [19:33:58] any channel admin around? [19:36:01] bblack mutante andrewbogott YuviPanda Reedy robh ^^ - you guys are... [19:36:58] (a) handsome, (b) friendly, (c) talented, (d) all of the above [19:37:34] please lock in your votes now [19:38:47] ori, you (d) guy, you are also channel admin... [19:39:26] I don't think I am, but anyways, what's up? [19:39:53] you are +Aiortv [19:40:00] ori: can you mute wikibugs for a bit, pls? [19:40:26] do you need to mass-change a pile of tasks or something? [19:40:28] i'm about to perform some batch edit which will involve 110 tasks, so no need to spam here... [19:41:13] but wikibugs is on 6+ channels -- will only operations-related tasks be affected? [19:42:01] hmm, true... i guess dev will be most flooded [19:42:12] is there any wikibugs command to silent it? [19:42:20] we fixed that it does NOT get kicked that fast anymore when it floods.. y :) [19:42:31] but there is still the second level limit [19:42:58] stweard bot has @silent or sth like that command [19:42:58] so the "feature" that it died is gone. just let it flood i guess [19:43:23] mutante: if somebody will complain, i'll blame it on you then, ok? :-P [19:43:48] yeah. your conscientiousness with respect to IRC acoustics is noted (and appreciated), but just do it IMO. [19:44:00] ok, let me change that to " ask andre_ and greg-g how they do it when they mass edit" :) [19:44:28] or just mute wikibugs on dev - that will be the biggest victim, others will survive - not so many tasks in them [19:44:36] otoh , hard to beat icinga-wm [19:44:54] most ppl have it on ignore list [19:45:00] unlike wikibugs ;-) [19:46:35] Danny_B: go ahead [19:46:36] are you adding a lot of "tracking" ? [19:46:52] mutante: nope. epic [19:47:03] tracking done yesterday and day before [19:47:08] Danny_B: I killed wikibugs for now, please let me know when I can restart it [19:47:34] valhallasw`cloud: running, will let you know when done [19:47:35] thanks [19:47:40] valhallasw`cloud: \o/ thank you [19:47:55] 25% [19:48:15] 50% [19:48:34] 75% [19:49:08] valhallasw`cloud: done. thanks [19:49:23] valhallasw`cloud: have you coded that bot? [19:49:33] Danny_B: legoktm, mostly, but I'm one of the maintainers, yes [19:50:20] valhallasw`cloud: would it be hard to add the @mute / @speak commands to it? [19:50:55] it would be more practical than killing the bot ;-) [19:51:18] Danny_B: it would require fun like authentication [19:51:47] nope, just have the list of allowed nicks [19:52:03] stewardbot has it that way somehow iirc [19:52:30] yes, so some form of authentication ;-) [19:52:35] have list of allowed nicks, match them against uses = fun like authentication [19:53:35] maybe easier to use a web ui in labs with the ldap users [19:56:08] !log launching some test query traffic against labs DNS to test new settings [19:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:03:15] mutante: yup [20:03:26] good idea... go for it! ;-) [20:08:09] i'd rather puppetize eggdrop [20:08:32] you want the mute feature :) [20:09:49] PROBLEM - puppet last run on lvs4004 is CRITICAL: CRITICAL: puppet fail [20:10:49] * YuviPanda mumbles something about 'yes, this pig, how about we try some glitter lipstick' and then ignores all responses [20:30:07] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=839.50 Read Requests/Sec=684.20 Write Requests/Sec=4.20 KBytes Read/Sec=28661.20 KBytes_Written/Sec=484.80 [20:36:49] RECOVERY - puppet last run on lvs4004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:44:49] PROBLEM - DPKG on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:44:50] PROBLEM - salt-minion processes on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:45:00] PROBLEM - Disk space on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:45:30] [20:45:39] PROBLEM - RAID on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:45:49] PROBLEM - Labs LDAP on serpens is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:45:59] PROBLEM - configured eth on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:46:09] PROBLEM - dhclient process on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:46:39] RECOVERY - salt-minion processes on serpens is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:46:39] RECOVERY - DPKG on serpens is OK: All packages OK [20:46:50] RECOVERY - Disk space on serpens is OK: DISK OK [20:47:20] RECOVERY - RAID on serpens is OK: OK: no RAID installed [20:47:30] RECOVERY - Labs LDAP on serpens is OK: LDAP OK - 0.113 seconds response time [20:47:40] RECOVERY - configured eth on serpens is OK: OK - interfaces up [20:47:59] RECOVERY - dhclient process on serpens is OK: PROCS OK: 0 processes with command name dhclient [20:54:06] (03PS1) 10Ottomata: Initial debian packaging [debs/druid] - 10https://gerrit.wikimedia.org/r/287285 [20:54:24] (03PS2) 10Ottomata: Initial debian packaging [debs/druid] - 10https://gerrit.wikimedia.org/r/287285 (https://phabricator.wikimedia.org/T134503) [20:55:33] (03PS3) 10Ottomata: Initial debian packaging [debs/druid] - 10https://gerrit.wikimedia.org/r/287285 (https://phabricator.wikimedia.org/T134503) [20:56:20] (03CR) 10Ottomata: "First pass ready for review. I'm sure this will need lots of tweaks as we figure this out a little more, but for the time being this work" [debs/druid] - 10https://gerrit.wikimedia.org/r/287285 (https://phabricator.wikimedia.org/T134503) (owner: 10Ottomata) [21:06:37] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5884017 keys - replication_delay is 0 [21:14:41] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2271125 (10mmodell) [21:50:43] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:50:54] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:52] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] [21:54:13] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [1000.0] [21:58:12] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [22:03:23] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:12] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [22:10:52] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [22:17:33] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:17:52] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:19:23] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [22:19:34] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [22:24:09] 07Puppet, 06Labs, 07Documentation: Missing documentation for labs puppet roles - https://phabricator.wikimedia.org/T91770#2271263 (10Danny_B) [22:26:23] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:26:24] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:26:24] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:26:32] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:26:33] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:26:33] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:26:33] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:26:43] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:26:43] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:26:43] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:27:12] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:27:12] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:27:13] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:27:13] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:28:13] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [22:28:13] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [22:28:13] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [22:28:22] RECOVERY - MariaDB Slave Lag: s1 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 87715.09 seconds [22:28:23] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [22:28:23] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [22:28:23] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [22:28:33] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [22:28:33] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [22:28:33] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [22:28:53] RECOVERY - MariaDB Slave Lag: s3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 86758.27 seconds [22:28:54] RECOVERY - MariaDB Slave Lag: m2 on dbstore1001 is OK: OK slave_sql_lag not a slave [22:28:54] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [22:29:03] RECOVERY - MariaDB Slave Lag: s6 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 84983.72 seconds [22:30:50] (03PS1) 10Alex Monk: Fix default IRC rc passwords [labs/private] - 10https://gerrit.wikimedia.org/r/287293 [22:30:52] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:31:13] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:31:59] (03CR) 10Alex Monk: "(I just put FAKEFAKEFAKE through mkpassword on deployment-ircd)" [labs/private] - 10https://gerrit.wikimedia.org/r/287293 (owner: 10Alex Monk) [22:34:12] (03PS1) 10Alex Monk: deployment-prep: Configure irc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287294 [22:35:13] (03CR) 10Alex Monk: [C: 032] deployment-prep: Configure irc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287294 (owner: 10Alex Monk) [22:35:38] (03Merged) 10jenkins-bot: deployment-prep: Configure irc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/287294 (owner: 10Alex Monk) [22:37:03] !log krenair@tin Synchronized wmf-config/InitialiseSettings-labs.php: labs-only change: https://gerrit.wikimedia.org/r/#/c/287294/ (duration: 00m 45s) [22:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:37:49] !log krenair@tin Synchronized wmf-config/LabsServices.php: labs-only change: https://gerrit.wikimedia.org/r/#/c/287294/ (duration: 00m 33s) [22:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:38:51] (03CR) 10Alex Monk: [C: 031] "Works in labs after I99834aeb (labs/private repo)" [puppet] - 10https://gerrit.wikimedia.org/r/286783 (https://phabricator.wikimedia.org/T134271) (owner: 10Dzahn) [22:38:59] 06Operations, 10DBA: Replicate the Phabricator database to labsdb - https://phabricator.wikimedia.org/T52422#2271291 (10scfc) [22:39:41] 06Operations, 10Dumps-Generation, 07HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2271293 (10ArielGlenn) Ori had problems reproducing this so I am putting the exact script and data file with step by step "how to reproduce" here, which wil... [22:39:54] ^ it looks like I can +2 on that labs private repo, though I'd prefer ops to do it [22:46:43] (03CR) 10Alex Monk: [C: 04-1] "I set up deployment-ircd using I3b9fe33d and tested it there, turns out this doesn't work" [puppet] - 10https://gerrit.wikimedia.org/r/287246 (owner: 10Alex Monk) [22:49:48] https://grafana-admin.wikimedia.org asks user pass [22:50:25] Is there a way to give me access? I want to setup some graphite dashboards [22:51:59] requires nda [22:52:10] we should probably add an extra group allowed to edit that without requiring nda [22:52:23] (03PS3) 10Alex Monk: udpmxircecho: use a config file [puppet] - 10https://gerrit.wikimedia.org/r/287246 [22:52:25] (03PS3) 10Alex Monk: udpmxircecho: Move from template to file [puppet] - 10https://gerrit.wikimedia.org/r/287247 [22:52:59] honestly why reading logs requires NDA [22:53:02] :D [22:53:07] but anyway [22:53:10] thanks [22:54:01] Amir1, we could do an NDA. I think that might be important for our move to prod anyway. [22:55:04] halfak: it would be great [22:55:49] OK email sent. [22:55:54] Will let you know [22:56:23] thanks [22:56:43] one thing. I signed an NDA for OTRS but i think it's different [22:58:18] honestly why reading logs requires NDA [22:58:28] the logs can contain private information [22:58:43] yeah [22:58:49] but in labs [22:58:50] Amir1, yes, OTRS/on-wiki rights have separate NDAs to the technical systems [22:59:08] what logs in labs do you want access to? [22:59:11] AFAIK the only thing is user agent [22:59:32] I don't want to access logs right in labs [22:59:40] I just login to the instance [22:59:59] but I need to setup graphite dashboard for some instances in labs [23:03:00] which instance? [23:03:14] I'm confused about what the problem is [23:04:05] wikilabels-01.wikilabels.eqiad.wmflabs [23:04:08] Krenair: ^ [23:06:20] (03CR) 10Alex Monk: [C: 04-1] udpmxircecho: Move from template to file [puppet] - 10https://gerrit.wikimedia.org/r/287247 (owner: 10Alex Monk) [23:22:39] (03CR) 10Dzahn: [C: 032] Fix default IRC rc passwords [labs/private] - 10https://gerrit.wikimedia.org/r/287293 (owner: 10Alex Monk) [23:22:49] (03CR) 10Dzahn: [V: 032] Fix default IRC rc passwords [labs/private] - 10https://gerrit.wikimedia.org/r/287293 (owner: 10Alex Monk) [23:24:24] mutante: Re mass-batch-editing Phab tasks: Well, I just ignore wikibugs. I coldn't care less if it died or not. :P [23:25:20] andre__: :) yep [23:27:02] I don't get people on IRC saying "Couldn't you mute that bot before mass-editing?!" or "Andre, tell me why that user subscribes to all Phab tasks". Over those years I came to the realization that I can neither read the mind of bots nor other people. [23:27:33] ...and now back to random paperback ("Annual Reviews"). [23:28:10] andre__: i also thought there are just 33 minutes left but we got an extension, [23:33:40] (03PS4) 10Alex Monk: udpmxircecho: Move from template to file [puppet] - 10https://gerrit.wikimedia.org/r/287247 [23:35:37] 06Operations, 10media-storage, 13Patch-For-Review: Unable to delete, restore/undelete, move or upload new versions of files on several wikis ("inconsistent state within the internal storage backends") - https://phabricator.wikimedia.org/T128096#2271362 (10aaron) 05Open>03Resolved +channel:FileOperation i... [23:46:18] (03CR) 10Dzahn: [C: 031] Diamond collector for nagios plugin return codes [puppet] - 10https://gerrit.wikimedia.org/r/287121 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [23:47:07] (03CR) 10Dzahn: [C: 032] "noop because nothing uses it yet" [puppet] - 10https://gerrit.wikimedia.org/r/287121 (https://phabricator.wikimedia.org/T111064) (owner: 10Alex Monk) [23:47:29] mutante: WHAT? Extension? I should check email I guess. Then I could drink booze now instead of quickly typing random stuff... [23:48:04] andre__: lol, yes, same here. until Sunday [23:48:49] mutante, Whoot. Thanks for telling me to check mail. Well, I guess this bottle of Whiskey has suddenly higher priority than the last form to fill out. Yay! [23:49:07] haha, there is whiskey talk on multiple channels [23:49:21] yw [23:54:29] PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] [23:58:23] 06Operations, 06WMF-Legal, 07Privacy: Consider moving policy.wikimedia.org away from WordPress.com - https://phabricator.wikimedia.org/T132104#2271413 (10ashley) Porting over WordPress skins is pretty easy (though you can almost always ditch the PHP code, since MediaWiki is rather different in that respect)....