[02:19:27] (03PS1) 10Reedy: phpcs changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375709 [02:19:50] (03PS2) 10Reedy: phpcs changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375709 [02:21:32] (03PS9) 10Reedy: Update mediawiki-codesniffer to 0.11.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367465 [02:23:47] (03CR) 10jerkins-bot: [V: 04-1] Update mediawiki-codesniffer to 0.11.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367465 (owner: 10Reedy) [02:37:23] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.16) (duration: 08m 17s) [02:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:11] (03PS3) 10Reedy: Generate FancyCaptchas in 4 threads [puppet] - 10https://gerrit.wikimedia.org/r/358395 (https://phabricator.wikimedia.org/T157736) [02:41:34] (03CR) 10jerkins-bot: [V: 04-1] Generate FancyCaptchas in 4 threads [puppet] - 10https://gerrit.wikimedia.org/r/358395 (https://phabricator.wikimedia.org/T157736) (owner: 10Reedy) [02:42:31] (03PS4) 10Reedy: Generate FancyCaptchas in 4 threads [puppet] - 10https://gerrit.wikimedia.org/r/358395 (https://phabricator.wikimedia.org/T157736) [02:44:15] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Sep 4 02:44:15 UTC 2017 (duration 6m 52s) [02:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:47:32] (03PS6) 10Reedy: Optionally filter private wiki results in mwgrep [puppet] - 10https://gerrit.wikimedia.org/r/262068 (https://phabricator.wikimedia.org/T71581) [02:48:43] (03PS7) 10Reedy: Optionally filter private wiki results in mwgrep [puppet] - 10https://gerrit.wikimedia.org/r/262068 (https://phabricator.wikimedia.org/T71581) [02:49:16] (03CR) 10Reedy: "Any mind if I add this to a puppet swat?" [puppet] - 10https://gerrit.wikimedia.org/r/262068 (https://phabricator.wikimedia.org/T71581) (owner: 10Reedy) [03:26:52] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 793.26 seconds [04:15:31] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 147.23 seconds [04:52:00] !log Deploy alter table on db1052 - T168661 [04:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:52:15] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [05:22:52] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [05:24:31] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [05:30:01] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:31:31] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:11:31] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [06:18:11] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [06:23:51] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [06:24:21] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [06:41:25] (03PS1) 10Muehlenhoff: Also remove email address [puppet] - 10https://gerrit.wikimedia.org/r/375725 [06:42:41] (03CR) 10Muehlenhoff: [C: 032] Also remove email address [puppet] - 10https://gerrit.wikimedia.org/r/375725 (owner: 10Muehlenhoff) [06:48:09] (03PS2) 10Muehlenhoff: Remove access for dworley [puppet] - 10https://gerrit.wikimedia.org/r/374979 [06:53:06] (03CR) 10Muehlenhoff: [C: 032] Remove access for dworley [puppet] - 10https://gerrit.wikimedia.org/r/374979 (owner: 10Muehlenhoff) [07:02:34] <_joe_> !log starting additional runJobs instance for htmlcacheupdate on commons T173710 [07:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:49] T173710: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710 [07:03:18] !log Deploy alter table on s3 codfw master with replication on kpbwiki - T168661 [07:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:29] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [07:11:42] (03PS1) 10Muehlenhoff: Record extended MOU dates for flemmerich and nettrom [puppet] - 10https://gerrit.wikimedia.org/r/375728 [07:12:48] (03CR) 10Muehlenhoff: [C: 032] Record extended MOU dates for flemmerich and nettrom [puppet] - 10https://gerrit.wikimedia.org/r/375728 (owner: 10Muehlenhoff) [07:15:08] (03PS1) 10Muehlenhoff: Remove access for ironholds [puppet] - 10https://gerrit.wikimedia.org/r/375731 [07:16:51] (03CR) 10Muehlenhoff: [C: 032] Remove access for ironholds [puppet] - 10https://gerrit.wikimedia.org/r/375731 (owner: 10Muehlenhoff) [07:25:33] 10Operations, 10Cloud-Services, 10Cloud-VPS, 10Patch-For-Review, 10Wikimedia-Incident: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#3576434 (10hashar) @andrew is there still a lot of leaks happening? Current Nodepool id is 806287 [07:35:48] ah nice, first log that I see is marostegui doing alter tables, my week can start :D [07:36:08] XDDDDDD [07:36:18] and pretty early! [07:39:06] !log installing gnupg security updates [07:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:52] (03PS1) 10Ladsgroup: mediawiki: make the wikidata wb_terms rebuild a little bit faster [puppet] - 10https://gerrit.wikimedia.org/r/375741 (https://phabricator.wikimedia.org/T171460) [07:43:19] 10Operations, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3576445 (10ema) p:05Triage>03Normal [07:46:26] 10Operations, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3576064 (10ema) Thanks @elukey! Yeah cp4024 might be having hardware issues. The system was down yesterday at 9ish AM UTC. I've power-cycled it and it came back online fine, but then after some hours it started with the l... [07:47:17] (03CR) 10Marostegui: [C: 031] mediawiki: make the wikidata wb_terms rebuild a little bit faster [puppet] - 10https://gerrit.wikimedia.org/r/375741 (https://phabricator.wikimedia.org/T171460) (owner: 10Ladsgroup) [08:03:33] (03PS10) 10Elukey: eventlogging_cleaner.py: improve sanitize method [puppet] - 10https://gerrit.wikimedia.org/r/374823 (https://phabricator.wikimedia.org/T156933) (owner: 10Mforns) [08:03:36] 10Operations, 10HHVM: HHVM: Unknown exception - https://phabricator.wikimedia.org/T173705#3576459 (10Legoktm) [08:03:37] (03PS11) 10Elukey: eventlogging_cleaner.py: improve sanitize method [puppet] - 10https://gerrit.wikimedia.org/r/374823 (https://phabricator.wikimedia.org/T156933) (owner: 10Mforns) [08:14:09] (03PS1) 10Giuseppe Lavagetto: Add entries for the jobrunner LVS service [dns] - 10https://gerrit.wikimedia.org/r/375747 (https://phabricator.wikimedia.org/T174599) [08:28:11] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1045 - https://phabricator.wikimedia.org/T174806#3576491 (10Marostegui) [08:29:20] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375748 (https://phabricator.wikimedia.org/T174806) [08:33:34] (03PS1) 10Marostegui: mariadb: Remove db1045 - will be decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/375750 (https://phabricator.wikimedia.org/T174806) [08:33:39] !log upload scap 3.7.0-1 to apt.w.o - T127762 [08:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:52] T127762: Update Debian Package for Scap3 - https://phabricator.wikimedia.org/T127762 [08:33:58] (03PS2) 10Filippo Giunchedi: scap: upgrade to 3.7.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/375029 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [08:35:29] (03CR) 10Filippo Giunchedi: [C: 032] scap: upgrade to 3.7.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/375029 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [08:36:23] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/7690/" [puppet] - 10https://gerrit.wikimedia.org/r/375750 (https://phabricator.wikimedia.org/T174806) (owner: 10Marostegui) [08:36:53] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Remove db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375748 (https://phabricator.wikimedia.org/T174806) (owner: 10Marostegui) [08:38:32] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375748 (https://phabricator.wikimedia.org/T174806) (owner: 10Marostegui) [08:38:42] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375748 (https://phabricator.wikimedia.org/T174806) (owner: 10Marostegui) [08:39:45] !log installing libgcrypt20 security updates [08:39:45] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Remove db1045 as it will be decommissioned - T174806 (duration: 00m 48s) [08:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:58] T174806: Decommission db1045 - https://phabricator.wikimedia.org/T174806 [08:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:40] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove db1045 as it will be decommissioned - T174806 (duration: 00m 46s) [08:40:47] (03PS3) 10ArielGlenn: Removed dead FTP link from mirror list. [puppet] - 10https://gerrit.wikimedia.org/r/375614 (owner: 10Felipe L. Ewald) [08:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:15] (03PS2) 10Marostegui: mariadb: Remove db1045 - will be decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/375750 (https://phabricator.wikimedia.org/T174806) [08:42:01] (03CR) 10Marostegui: [C: 032] mariadb: Remove db1045 - will be decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/375750 (https://phabricator.wikimedia.org/T174806) (owner: 10Marostegui) [08:42:06] (03CR) 10ArielGlenn: [C: 032] Removed dead FTP link from mirror list. [puppet] - 10https://gerrit.wikimedia.org/r/375614 (owner: 10Felipe L. Ewald) [08:42:37] (03PS4) 10ArielGlenn: Removed dead FTP link from mirror list. [puppet] - 10https://gerrit.wikimedia.org/r/375614 (owner: 10Felipe L. Ewald) [08:45:16] 10Operations, 10HHVM: HHVM: Unknown exception - https://phabricator.wikimedia.org/T173705#3537173 (10Legoktm) 2.0.14 has been tagged. [08:46:33] (03PS2) 10Filippo Giunchedi: Fixup (obviously) typo'd data_file_directories entries [puppet] - 10https://gerrit.wikimedia.org/r/375400 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [08:46:58] (03PS1) 10Marostegui: s5.hosts: Remove db1045 - will be decommissioned [software] - 10https://gerrit.wikimedia.org/r/375752 [08:47:20] !log installing mercurial security updates [08:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:36] (03CR) 10Filippo Giunchedi: [C: 032] Fixup (obviously) typo'd data_file_directories entries [puppet] - 10https://gerrit.wikimedia.org/r/375400 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [08:47:55] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1045 - https://phabricator.wikimedia.org/T174806#3576541 (10Marostegui) [08:48:13] (03CR) 10Marostegui: [C: 032] s5.hosts: Remove db1045 - will be decommissioned [software] - 10https://gerrit.wikimedia.org/r/375752 (owner: 10Marostegui) [08:48:41] !log Stop MySQL on db1045 as it will be decommissioned - T174806 [08:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:52] T174806: Decommission db1045 - https://phabricator.wikimedia.org/T174806 [08:48:54] (03PS1) 10Filippo Giunchedi: typos: add cassanrda [puppet] - 10https://gerrit.wikimedia.org/r/375753 [08:48:57] (03Merged) 10jenkins-bot: s5.hosts: Remove db1045 - will be decommissioned [software] - 10https://gerrit.wikimedia.org/r/375752 (owner: 10Marostegui) [08:49:05] (03PS2) 10Filippo Giunchedi: typos: add cassanrda [puppet] - 10https://gerrit.wikimedia.org/r/375753 [08:49:38] (03CR) 10jerkins-bot: [V: 04-1] typos: add cassanrda [puppet] - 10https://gerrit.wikimedia.org/r/375753 (owner: 10Filippo Giunchedi) [08:50:28] (03CR) 10Elukey: "After a chat with the wise Volans we agreed that a follow up CR will be needed before the script will become a cron to:" [puppet] - 10https://gerrit.wikimedia.org/r/374823 (https://phabricator.wikimedia.org/T156933) (owner: 10Mforns) [08:50:50] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1045 - https://phabricator.wikimedia.org/T174806#3576546 (10Marostegui) a:03Cmjohnson This server is ready to be totally decommissioned and only pending the DC Ops steps, so assigning it to @Cmjohnson for the pend... [08:51:27] (03CR) 10Volans: [C: 031] "Great, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/374823 (https://phabricator.wikimedia.org/T156933) (owner: 10Mforns) [08:51:43] (03PS3) 10Filippo Giunchedi: typos: add cassanrda [puppet] - 10https://gerrit.wikimedia.org/r/375753 [08:52:19] (03CR) 10jerkins-bot: [V: 04-1] typos: add cassanrda [puppet] - 10https://gerrit.wikimedia.org/r/375753 (owner: 10Filippo Giunchedi) [08:52:34] godog: circular dependency :-P [08:52:46] !log restart zookeeper on conf2003 for jvm security updates [08:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:09] <_joe_> hieradata/role/common/restbase/production_ng.yaml :P [08:53:10] you cannot add a typo and put it into the commit message, lol :D [08:53:27] <_joe_> volans: that's not it [08:53:35] * volans sarcarsm... [08:53:36] <_joe_> volans: the typo is actually present [08:53:54] I merged the patch to fix the typo earlier though [08:54:13] <_joe_> is this patch rebased on top of it? [08:54:22] <_joe_> if so, the problem might be with the rakefile [08:54:34] <_joe_> lemme finish what I'm doing and I'll take a look [08:54:42] btw it will be nice to report where it found the typo... [08:54:48] 10Operations, 10ops-eqiad, 10DBA: Decommission db1037 - https://phabricator.wikimedia.org/T174902#3576559 (10Marostegui) [08:54:57] <_joe_> yeah I just copied that over from the old rakefile, meh [08:55:08] _joe_: it is yeah, the only instance of the typo is in typos itself [08:55:09] 10Operations, 10ops-eqiad, 10DBA: Decommission db1037 - https://phabricator.wikimedia.org/T174902#3576571 (10Marostegui) p:05Triage>03Normal [08:55:37] <_joe_> I'm in the middle of a patch, I'll take a look in a minute [08:55:52] (03PS12) 10Elukey: eventlogging_cleaner.py: improve sanitize method [puppet] - 10https://gerrit.wikimedia.org/r/374823 (https://phabricator.wikimedia.org/T156933) (owner: 10Mforns) [08:57:46] 10Operations, 10Performance-Team, 10Thumbor, 10Patch-For-Review, 10User-fgiunchedi: Long running thumbnail requests locking up Thumbor instances - https://phabricator.wikimedia.org/T172930#3576576 (10Gilles) 05Open>03Resolved [08:59:22] anyways not that urgent, I'll keep going with other reviews anyways [09:00:37] 10Operations, 10Performance-Team, 10Thumbor, 10Patch-For-Review, 10User-fgiunchedi: Track incoming HTTP request count on the Thumbor boxes - https://phabricator.wikimedia.org/T151554#3576591 (10Gilles) 05Open>03Resolved Very nice! [09:00:43] (03PS11) 10Hashar: Jenkins integration of rspec [puppet] - 10https://gerrit.wikimedia.org/r/331856 (https://phabricator.wikimedia.org/T78342) [09:00:59] (03CR) 10Elukey: [C: 032] eventlogging_cleaner.py: improve sanitize method [puppet] - 10https://gerrit.wikimedia.org/r/374823 (https://phabricator.wikimedia.org/T156933) (owner: 10Mforns) [09:02:19] (03PS1) 10Volans: Icinga: raid handler, catch zlib exceptions [puppet] - 10https://gerrit.wikimedia.org/r/375755 (https://phabricator.wikimedia.org/T174854) [09:02:21] (03PS1) 10Volans: Raid: optimize get raid status for HP controllers [puppet] - 10https://gerrit.wikimedia.org/r/375756 (https://phabricator.wikimedia.org/T174854) [09:02:38] <_joe_> godog: so interestingly [09:04:14] <_joe_> this is a bug in the rakefile indeed [09:04:20] <_joe_> I'm fixing it [09:04:35] <_joe_> I didn't consider the case in which only the typos file was modified [09:06:14] _joe_: ack, thanks for taking a look! [09:06:21] (03PS4) 10Filippo Giunchedi: Allow more per-instance overrides [puppet] - 10https://gerrit.wikimedia.org/r/375414 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [09:08:04] (03CR) 10Filippo Giunchedi: [C: 032] Allow more per-instance overrides [puppet] - 10https://gerrit.wikimedia.org/r/375414 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [09:08:26] (03PS4) 10Giuseppe Lavagetto: typos: add cassanrda [puppet] - 10https://gerrit.wikimedia.org/r/375753 (owner: 10Filippo Giunchedi) [09:08:28] (03PS1) 10Giuseppe Lavagetto: Rakefile: properly exclude typos file from grepping [puppet] - 10https://gerrit.wikimedia.org/r/375757 [09:08:49] <_joe_> godog: ^^ [09:09:32] <_joe_> merge both at your earliest convenience :) [09:09:33] (03CR) 10Filippo Giunchedi: [C: 031] Rakefile: properly exclude typos file from grepping [puppet] - 10https://gerrit.wikimedia.org/r/375757 (owner: 10Giuseppe Lavagetto) [09:09:37] ack [09:09:42] (03CR) 10Filippo Giunchedi: [C: 032] Rakefile: properly exclude typos file from grepping [puppet] - 10https://gerrit.wikimedia.org/r/375757 (owner: 10Giuseppe Lavagetto) [09:10:52] (03CR) 10Filippo Giunchedi: [C: 032] typos: add cassanrda [puppet] - 10https://gerrit.wikimedia.org/r/375753 (owner: 10Filippo Giunchedi) [09:13:26] (03PS1) 10Giuseppe Lavagetto: Rakefile: print offending files when searching typos [puppet] - 10https://gerrit.wikimedia.org/r/375759 [09:13:40] <_joe_> volans: ^^ you've been served, sir [09:14:01] PROBLEM - HHVM rendering on mw1296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:14:52] RECOVERY - HHVM rendering on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 80974 bytes in 0.169 second response time [09:18:50] _joe_: thanks! :D [09:20:58] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [09:21:00] (03PS9) 10Hashar: systemd: allow isequal to match programname in/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/337411 [09:21:16] (03CR) 10Filippo Giunchedi: [C: 032] Configure restbase2001 instance data paths [puppet] - 10https://gerrit.wikimedia.org/r/375415 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [09:21:56] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [09:22:14] (03CR) 10jerkins-bot: [V: 04-1] systemd: allow isequal to match programname in/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/337411 (owner: 10Hashar) [09:22:59] (03PS4) 10Filippo Giunchedi: Configure restbase2001 instance data paths [puppet] - 10https://gerrit.wikimedia.org/r/375415 (https://phabricator.wikimedia.org/T169939) (owner: 10Eevans) [09:24:55] (03PS10) 10Hashar: systemd: allow isequal to match programname in/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/337411 [09:36:39] !log uploaded debdeploy 0.0.99-1/stretch-wikimedia to apt.wikimedia.org [09:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:55] RECOVERY - Check systemd state on chlorine is OK: OK - running: The system is fully operational [09:40:55] PROBLEM - Check systemd state on chlorine is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:41:24] !log mobrovac@tin Started restart [mathoid/deploy@44ea6d8]: Restart for libxml update [09:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:35] moritzm: ^^^ [09:42:07] (03PS12) 10Hashar: Jenkins integration of rspec [puppet] - 10https://gerrit.wikimedia.org/r/331856 (https://phabricator.wikimedia.org/T78342) [09:42:51] mobrovac: thanjs [09:43:05] (03PS11) 10Hashar: systemd: allow isequal to match programname in/rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/337411 [09:49:30] (03PS13) 10Hashar: Jenkins integration of rspec [puppet] - 10https://gerrit.wikimedia.org/r/331856 (https://phabricator.wikimedia.org/T78342) [09:51:42] !log uploaded debdeploy 0.0.99-1+jessie/jessie-wikimedia to apt.wikimedia.org [09:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:33] (03PS1) 10Joal: Update pivot config with explicit dimensions [puppet] - 10https://gerrit.wikimedia.org/r/375762 (https://phabricator.wikimedia.org/T161824) [10:01:54] (03CR) 10jerkins-bot: [V: 04-1] Update pivot config with explicit dimensions [puppet] - 10https://gerrit.wikimedia.org/r/375762 (https://phabricator.wikimedia.org/T161824) (owner: 10Joal) [10:02:55] RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.038 second response time on 10.192.16.162 port 9042 [10:03:25] elukey: Can we manually test that one then (explicit cong) --^ [10:04:32] joal: jenkins complains about the commit msg - https://integration.wikimedia.org/ci/job/operations-puppet-tests-docker/4293/console [10:05:23] elukey: Bug should only have one value [10:05:26] elukey: patching [10:06:04] (03PS2) 10Joal: Update pivot config with explicit dimensions [puppet] - 10https://gerrit.wikimedia.org/r/375762 (https://phabricator.wikimedia.org/T161824) [10:07:48] elukey: --^ [10:08:55] RECOVERY - cassandra-b CQL 10.192.16.163:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.163 port 9042 [10:15:05] (03PS1) 10ArielGlenn: Fix variant abstract dumps generation [dumps] - 10https://gerrit.wikimedia.org/r/375763 (https://phabricator.wikimedia.org/T174906) [10:15:16] RECOVERY - cassandra-c CQL 10.192.16.164:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.164 port 9042 [10:21:55] (03CR) 10ArielGlenn: [C: 032] Fix variant abstract dumps generation [dumps] - 10https://gerrit.wikimedia.org/r/375763 (https://phabricator.wikimedia.org/T174906) (owner: 10ArielGlenn) [10:22:47] !log ariel@tin Started deploy [dumps/dumps@0b32eef]: fix abstract dumps variants regression [10:22:50] !log ariel@tin Finished deploy [dumps/dumps@0b32eef]: fix abstract dumps variants regression (duration: 00m 02s) [10:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:18] (03PS3) 10Joal: Update pivot config with explicit dimensions [puppet] - 10https://gerrit.wikimedia.org/r/375762 (https://phabricator.wikimedia.org/T168550) [10:26:45] (03PS4) 10Elukey: Update pivot config with explicit dimensions [puppet] - 10https://gerrit.wikimedia.org/r/375762 (https://phabricator.wikimedia.org/T168550) (owner: 10Joal) [10:27:15] (03CR) 10Elukey: [C: 032] Update pivot config with explicit dimensions [puppet] - 10https://gerrit.wikimedia.org/r/375762 (https://phabricator.wikimedia.org/T168550) (owner: 10Joal) [10:34:22] (03PS1) 10ArielGlenn: move dumps other than the xml/sql dumps to new path on dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/375768 (https://phabricator.wikimedia.org/T169849) [10:34:47] (03CR) 10jerkins-bot: [V: 04-1] move dumps other than the xml/sql dumps to new path on dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/375768 (https://phabricator.wikimedia.org/T169849) (owner: 10ArielGlenn) [10:37:47] (03PS2) 10ArielGlenn: move dumps other than the xml/sql dumps to new path on dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/375768 (https://phabricator.wikimedia.org/T169849) [10:44:27] (03CR) 10ArielGlenn: [C: 032] move dumps other than the xml/sql dumps to new path on dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/375768 (https://phabricator.wikimedia.org/T169849) (owner: 10ArielGlenn) [10:57:17] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3576917 (10mobrovac) [10:57:20] 10Operations, 10ops-eqiad, 10Services (doing): Disk errors: restbase1010.eqiad.wmnet - https://phabricator.wikimedia.org/T174392#3576916 (10mobrovac) [10:57:43] 10Operations, 10ops-eqiad, 10Services (watching): Disk errors: restbase1010.eqiad.wmnet - https://phabricator.wikimedia.org/T174392#3560205 (10mobrovac) a:05Eevans>03None [11:02:26] (03PS1) 10Volans: Transports: better handling of empty list [software/cumin] - 10https://gerrit.wikimedia.org/r/375769 (https://phabricator.wikimedia.org/T174911) [11:03:09] (03Abandoned) 10Matthias Mullie: Add missing THREED2PNG_PATH [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/373595 (https://phabricator.wikimedia.org/T161719) (owner: 10Matthias Mullie) [11:04:51] 10Operations, 10Electron-PDFs, 10Readers-Web-Backlog, 10Services (blocked): electron/pdfrender hangs - https://phabricator.wikimedia.org/T174916#3576939 (10mobrovac) [11:06:03] 10Operations, 10Electron-PDFs, 10Patch-For-Review, 10Readers-Web-Backlog (Tracking), 10Services (blocked): pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3083419 (10mobrovac) 05Open>03Resolved a:03mobrovac Agreed, this task has beco... [11:06:19] 10Operations, 10Electron-PDFs, 10Readers-Web-Backlog (Tracking), 10Services (done): pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3576961 (10mobrovac) [11:07:36] RECOVERY - Check systemd state on chlorine is OK: OK - running: The system is fully operational [11:09:39] (03CR) 10Hoo man: "Deployment of this needs to be thoroughly monitored. I'll find a time for that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375544 (https://phabricator.wikimedia.org/T151717) (owner: 10Eranroz) [11:10:45] PROBLEM - Check systemd state on chlorine is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:11:36] (03PS1) 10Muehlenhoff: Show the amount of updated/non-updated hosts [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375770 [11:11:38] (03PS1) 10Muehlenhoff: Show an error if passing an invalid Cumin alias [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375771 [11:12:33] (03CR) 10Volans: [C: 031] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375771 (owner: 10Muehlenhoff) [11:14:05] (03PS1) 10Gilles: Only uses lua in Thumbor Nginx config if "extras" variant [puppet] - 10https://gerrit.wikimedia.org/r/375772 (https://phabricator.wikimedia.org/T174746) [11:15:23] (03CR) 10Volans: [C: 031] "Looks good, see one comment inline." (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375770 (owner: 10Muehlenhoff) [11:18:17] (03PS8) 10ArielGlenn: Add RDF dumps for categories [puppet] - 10https://gerrit.wikimedia.org/r/373354 (https://phabricator.wikimedia.org/T173892) (owner: 10Smalyshev) [11:21:39] (03PS1) 10Filippo Giunchedi: cassandra: reprovision restbase2003 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/375774 (https://phabricator.wikimedia.org/T169939) [11:29:08] (03PS1) 10Filippo Giunchedi: cassandra: add execute permission to jbod mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/375775 (https://phabricator.wikimedia.org/T169939) [11:33:43] (03PS9) 10ArielGlenn: Add RDF dumps for categories [puppet] - 10https://gerrit.wikimedia.org/r/373354 (https://phabricator.wikimedia.org/T173892) (owner: 10Smalyshev) [11:34:07] (03CR) 10jerkins-bot: [V: 04-1] Add RDF dumps for categories [puppet] - 10https://gerrit.wikimedia.org/r/373354 (https://phabricator.wikimedia.org/T173892) (owner: 10Smalyshev) [11:36:05] (03PS10) 10ArielGlenn: Add RDF dumps for categories [puppet] - 10https://gerrit.wikimedia.org/r/373354 (https://phabricator.wikimedia.org/T173892) (owner: 10Smalyshev) [11:40:26] 10Operations, 10ops-eqiad, 10DBA: Decommission db1037 - https://phabricator.wikimedia.org/T174902#3577022 (10Marostegui) [11:40:28] (03PS2) 10Filippo Giunchedi: cassandra: add execute permission to jbod mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/375775 (https://phabricator.wikimedia.org/T169939) [11:40:30] (03PS2) 10Filippo Giunchedi: cassandra: reprovision restbase2003 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/375774 (https://phabricator.wikimedia.org/T169939) [11:41:25] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2001535 [11:42:10] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db1037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375777 (https://phabricator.wikimedia.org/T174902) [11:44:18] (03PS11) 10ArielGlenn: Add RDF dumps for categories [puppet] - 10https://gerrit.wikimedia.org/r/373354 (https://phabricator.wikimedia.org/T173892) (owner: 10Smalyshev) [11:45:02] !log restart varnish on cp1099 - mailbox lag [11:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:06] (03CR) 10ArielGlenn: [C: 032] Add RDF dumps for categories [puppet] - 10https://gerrit.wikimedia.org/r/373354 (https://phabricator.wikimedia.org/T173892) (owner: 10Smalyshev) [11:47:23] (03PS1) 10Marostegui: mariadb: Remove db1037 for decomm [puppet] - 10https://gerrit.wikimedia.org/r/375779 (https://phabricator.wikimedia.org/T174902) [11:49:27] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Remove db1037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375777 (https://phabricator.wikimedia.org/T174902) (owner: 10Marostegui) [11:49:44] (03CR) 10Ema: varnish: introduce rate limiting for maps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/375354 (https://phabricator.wikimedia.org/T169175) (owner: 10Ema) [11:50:50] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375777 (https://phabricator.wikimedia.org/T174902) (owner: 10Marostegui) [11:50:59] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db1037 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375777 (https://phabricator.wikimedia.org/T174902) (owner: 10Marostegui) [11:51:25] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 0 [11:52:16] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Remove db1037 as it will be decommissioned - T174902 (duration: 00m 46s) [11:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:28] T174902: Decommission db1037 - https://phabricator.wikimedia.org/T174902 [11:53:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove db1037 as it will be decommissioned - T174902 (duration: 00m 46s) [11:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:09] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/7691/" [puppet] - 10https://gerrit.wikimedia.org/r/375779 (https://phabricator.wikimedia.org/T174902) (owner: 10Marostegui) [11:56:29] (03PS1) 10Gilles: Add STL support to Thumbor, behind flag [puppet] - 10https://gerrit.wikimedia.org/r/375781 (https://phabricator.wikimedia.org/T161719) [11:56:32] (03PS2) 10Marostegui: mariadb: Remove db1037 for decomm [puppet] - 10https://gerrit.wikimedia.org/r/375779 (https://phabricator.wikimedia.org/T174902) [12:01:01] (03PS1) 10ArielGlenn: use cleaner method of getting config settings for rdf cat dumps cron job [puppet] - 10https://gerrit.wikimedia.org/r/375783 [12:01:52] (03PS4) 10Ema: varnish: introduce rate limiting for maps [puppet] - 10https://gerrit.wikimedia.org/r/375354 (https://phabricator.wikimedia.org/T169175) [12:01:57] (03CR) 10Ema: [V: 032 C: 032] varnish: introduce rate limiting for maps [puppet] - 10https://gerrit.wikimedia.org/r/375354 (https://phabricator.wikimedia.org/T169175) (owner: 10Ema) [12:07:31] (03CR) 10Muehlenhoff: Show the amount of updated/non-updated hosts (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375770 (owner: 10Muehlenhoff) [12:08:11] (03CR) 10Muehlenhoff: [C: 032] Show the amount of updated/non-updated hosts [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375770 (owner: 10Muehlenhoff) [12:08:20] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Decommission db1037 - https://phabricator.wikimedia.org/T174902#3577181 (10Marostegui) [12:08:27] (03CR) 10Muehlenhoff: [C: 032] Show an error if passing an invalid Cumin alias [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375771 (owner: 10Muehlenhoff) [12:09:11] (03PS1) 10Muehlenhoff: Update TODO [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375786 [12:10:08] (03CR) 10Muehlenhoff: [C: 032] Update TODO [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375786 (owner: 10Muehlenhoff) [12:10:26] !log rolling restart of zookeeper on conf100[123] for jvm security updates [12:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:47] (03PS2) 10ArielGlenn: use cleaner method of getting config settings for rdf cat dumps cron job [puppet] - 10https://gerrit.wikimedia.org/r/375783 [12:12:50] (03CR) 10Marostegui: [C: 032] mariadb: Remove db1037 for decomm [puppet] - 10https://gerrit.wikimedia.org/r/375779 (https://phabricator.wikimedia.org/T174902) (owner: 10Marostegui) [12:12:55] (03PS3) 10Marostegui: mariadb: Remove db1037 for decomm [puppet] - 10https://gerrit.wikimedia.org/r/375779 (https://phabricator.wikimedia.org/T174902) [12:15:35] (03PS3) 10ArielGlenn: use cleaner method of getting config settings for rdf cat dumps cron job [puppet] - 10https://gerrit.wikimedia.org/r/375783 [12:16:22] (03CR) 10ArielGlenn: [C: 032] use cleaner method of getting config settings for rdf cat dumps cron job [puppet] - 10https://gerrit.wikimedia.org/r/375783 (owner: 10ArielGlenn) [12:16:25] 10Operations, 10ops-esams, 10netops: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637#3568685 (10faidon) @mark assigned asset tag `WMF4203` to this device. The image has also been generated (for AS43821) and can be found on install1002. [12:16:42] (03PS8) 10Phedenskog: Make values stackable [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) [12:17:53] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Decommission db1037 - https://phabricator.wikimedia.org/T174902#3577231 (10Marostegui) [12:18:15] !log Stop MySQL on db1037 as it is going to be decommissioned - T174902 [12:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:27] T174902: Decommission db1037 - https://phabricator.wikimedia.org/T174902 [12:19:38] (03CR) 10Phedenskog: Make values stackable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [12:22:05] (03PS1) 10Marostegui: s6.hosts: Remove db1037 for decommission [software] - 10https://gerrit.wikimedia.org/r/375790 (https://phabricator.wikimedia.org/T174902) [12:22:38] 10Operations, 10Discovery, 10Discovery-Analysis, 10Maps, and 3 others: What is a reasonable per-IP ratelimit for maps - https://phabricator.wikimedia.org/T169175#3389649 (10Gehel) Rate limiting has been enabled by @ema. Everything is looking good so far. This task can be closed and we'll open up follow up... [12:23:58] (03CR) 10Marostegui: [C: 032] s6.hosts: Remove db1037 for decommission [software] - 10https://gerrit.wikimedia.org/r/375790 (https://phabricator.wikimedia.org/T174902) (owner: 10Marostegui) [12:24:22] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Decommission db1037 - https://phabricator.wikimedia.org/T174902#3577256 (10Marostegui) a:03Cmjohnson MySQL is stopped. This host is now ready for the remaining DC Ops steps to be completed [12:26:29] !log uploaded debdeploy 0.0.99-1+trusty/trusty-wikimedia to apt.wikimedia.org [12:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:26] (03Merged) 10jenkins-bot: s6.hosts: Remove db1037 for decommission [software] - 10https://gerrit.wikimedia.org/r/375790 (https://phabricator.wikimedia.org/T174902) (owner: 10Marostegui) [12:47:18] (03PS1) 10ArielGlenn: convert "other" dump crons to use script to grabconfig settings [puppet] - 10https://gerrit.wikimedia.org/r/375791 (https://phabricator.wikimedia.org/T174929) [12:47:20] (03PS1) 10Muehlenhoff: Fix OS detection [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375792 [12:49:51] (03CR) 10Gilles: Make values stackable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [12:51:32] (03CR) 10Gilles: Make values stackable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [12:52:44] (03PS1) 10Muehlenhoff: Install debdeploy-client [puppet] - 10https://gerrit.wikimedia.org/r/375793 (https://phabricator.wikimedia.org/T164817) [13:03:13] 10Operations, 10Traffic: Recurrent 'mailbox lag' critical alerts and 500s - https://phabricator.wikimedia.org/T174932#3577349 (10fgiunchedi) [13:04:41] (03PS3) 10Filippo Giunchedi: cassandra: add execute permission to jbod mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/375775 (https://phabricator.wikimedia.org/T169939) [13:06:02] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/375793 (https://phabricator.wikimedia.org/T164817) (owner: 10Muehlenhoff) [13:06:16] (03CR) 10Filippo Giunchedi: [C: 032] cassandra: add execute permission to jbod mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/375775 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi) [13:06:26] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3577404 (10elukey) So we solved the issue with partman and I was able to install the os on kafka-jumbo100[12], but failed to PXE boot... [13:08:19] (03CR) 10Gilles: "Do you mean that you want the filter-logback.conf addition in its own patch?" [puppet] - 10https://gerrit.wikimedia.org/r/365619 (https://phabricator.wikimedia.org/T150734) (owner: 10Gilles) [13:12:00] (03PS2) 10Muehlenhoff: Put ganglia behind LDAP authentication [puppet] - 10https://gerrit.wikimedia.org/r/374320 [13:13:43] (03CR) 10Muehlenhoff: [C: 032] Put ganglia behind LDAP authentication [puppet] - 10https://gerrit.wikimedia.org/r/374320 (owner: 10Muehlenhoff) [13:14:17] (03PS1) 10Gehel: logstash - stop deploying custom elasticsearch plugins [puppet] - 10https://gerrit.wikimedia.org/r/375795 [13:14:42] (03CR) 10jerkins-bot: [V: 04-1] logstash - stop deploying custom elasticsearch plugins [puppet] - 10https://gerrit.wikimedia.org/r/375795 (owner: 10Gehel) [13:15:36] (03PS2) 10Gehel: logstash - stop deploying custom elasticsearch plugins [puppet] - 10https://gerrit.wikimedia.org/r/375795 [13:19:43] (03PS3) 10Filippo Giunchedi: cassandra: reprovision restbase2003 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/375774 (https://phabricator.wikimedia.org/T169939) [13:21:50] (03CR) 10Filippo Giunchedi: "> Do you mean that you want the filter-logback.conf addition in its" [puppet] - 10https://gerrit.wikimedia.org/r/365619 (https://phabricator.wikimedia.org/T150734) (owner: 10Gilles) [13:21:53] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash: Do not deploy Cirrus elasticsearch plugins on logstash cluster - https://phabricator.wikimedia.org/T174933#3577425 (10Gehel) [13:22:02] (03PS3) 10Gehel: logstash - stop deploying custom elasticsearch plugins [puppet] - 10https://gerrit.wikimedia.org/r/375795 (https://phabricator.wikimedia.org/T174933) [13:22:29] !log restart elasticsearch on logstash1006 to test plugin removal - T174933 [13:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:43] T174933: Do not deploy Cirrus elasticsearch plugins on logstash cluster - https://phabricator.wikimedia.org/T174933 [13:24:22] (03CR) 10DCausse: [C: 031] logstash - stop deploying custom elasticsearch plugins [puppet] - 10https://gerrit.wikimedia.org/r/375795 (https://phabricator.wikimedia.org/T174933) (owner: 10Gehel) [13:27:03] (03PS1) 10Muehlenhoff: Include ::passwords::ldap::production [puppet] - 10https://gerrit.wikimedia.org/r/375797 [13:29:21] (03PS4) 10Gehel: logstash - stop deploying custom elasticsearch plugins [puppet] - 10https://gerrit.wikimedia.org/r/375795 (https://phabricator.wikimedia.org/T174933) [13:29:32] (03CR) 10Muehlenhoff: [C: 032] Include ::passwords::ldap::production [puppet] - 10https://gerrit.wikimedia.org/r/375797 (owner: 10Muehlenhoff) [13:30:25] (03PS5) 10Gehel: logstash - stop deploying custom elasticsearch plugins [puppet] - 10https://gerrit.wikimedia.org/r/375795 (https://phabricator.wikimedia.org/T174933) [13:31:02] (03CR) 10Gehel: [C: 032] logstash - stop deploying custom elasticsearch plugins [puppet] - 10https://gerrit.wikimedia.org/r/375795 (https://phabricator.wikimedia.org/T174933) (owner: 10Gehel) [13:31:34] (03PS1) 10Giuseppe Lavagetto: jobrunner: add nginx service [puppet] - 10https://gerrit.wikimedia.org/r/375800 (https://phabricator.wikimedia.org/T174599) [13:31:36] (03PS1) 10Giuseppe Lavagetto: jobrunner: add LVS service configuration [puppet] - 10https://gerrit.wikimedia.org/r/375801 (https://phabricator.wikimedia.org/T174599) [13:31:50] (03PS1) 10Ladsgroup: Add Croatian language assets [puppet] - 10https://gerrit.wikimedia.org/r/375802 (https://phabricator.wikimedia.org/T172046) [13:34:30] !log delete leftover grafana-dashboards index on elasticsearch logstash cluster [13:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:08] (03CR) 10Filippo Giunchedi: "LGTM, minor nit" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/375781 (https://phabricator.wikimedia.org/T161719) (owner: 10Gilles) [13:36:12] 10Operations, 10DBA, 10Traffic: Substantive HTTP and mediawiki/database traffic coming from a single ip - https://phabricator.wikimedia.org/T166695#3577497 (10Marostegui) 05stalled>03Resolved Closing this as this has not happened again in months. If it happens again, let's reopen and follow up [13:37:02] !log rolling restart of elasticsearch/logstash to cleanup unused plugins - T174933 [13:37:04] (03CR) 10Filippo Giunchedi: [C: 04-1] "Wouldn't work in production, disabling metrics: https://puppet-compiler.wmflabs.org/compiler02/7698/thumbor1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/375772 (https://phabricator.wikimedia.org/T174746) (owner: 10Gilles) [13:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:16] T174933: Do not deploy Cirrus elasticsearch plugins on logstash cluster - https://phabricator.wikimedia.org/T174933 [13:41:13] (03CR) 10Filippo Giunchedi: [C: 031] Icinga: raid handler, catch zlib exceptions [puppet] - 10https://gerrit.wikimedia.org/r/375755 (https://phabricator.wikimedia.org/T174854) (owner: 10Volans) [13:42:55] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, just a nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/375756 (https://phabricator.wikimedia.org/T174854) (owner: 10Volans) [13:43:58] 10Operations, 10Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: Do not deploy Cirrus elasticsearch plugins on logstash cluster - https://phabricator.wikimedia.org/T174933#3577502 (10Gehel) p:05Triage>03High [13:44:43] 10Operations, 10Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: Do not deploy Cirrus elasticsearch plugins on logstash cluster - https://phabricator.wikimedia.org/T174933#3577425 (10Gehel) Cleanup of old plugins done, elasticsearch has been restarted, no errors are seen in the logs, we seem... [13:45:13] (03CR) 10Filippo Giunchedi: [C: 04-1] Only uses lua in Thumbor Nginx config if "extras" variant (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/375772 (https://phabricator.wikimedia.org/T174746) (owner: 10Gilles) [13:45:58] (03PS2) 10Volans: Icinga: raid handler, catch zlib exceptions [puppet] - 10https://gerrit.wikimedia.org/r/375755 (https://phabricator.wikimedia.org/T174854) [13:46:00] (03PS2) 10Volans: Raid: optimize get raid status for HP controllers [puppet] - 10https://gerrit.wikimedia.org/r/375756 (https://phabricator.wikimedia.org/T174854) [13:46:19] (03CR) 10Volans: "done" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/375756 (https://phabricator.wikimedia.org/T174854) (owner: 10Volans) [13:46:57] (03CR) 10Volans: [C: 032] Icinga: raid handler, catch zlib exceptions [puppet] - 10https://gerrit.wikimedia.org/r/375755 (https://phabricator.wikimedia.org/T174854) (owner: 10Volans) [13:47:16] (03CR) 10Volans: [C: 032] Raid: optimize get raid status for HP controllers [puppet] - 10https://gerrit.wikimedia.org/r/375756 (https://phabricator.wikimedia.org/T174854) (owner: 10Volans) [13:48:35] (03PS2) 10Giuseppe Lavagetto: jobrunner: add nginx service [puppet] - 10https://gerrit.wikimedia.org/r/375800 (https://phabricator.wikimedia.org/T174599) [13:48:37] (03PS2) 10Giuseppe Lavagetto: jobrunner: add LVS service configuration [puppet] - 10https://gerrit.wikimedia.org/r/375801 (https://phabricator.wikimedia.org/T174599) [13:54:12] (03PS3) 10Elukey: stat1003: remove puppet configuration as part of decom [puppet] - 10https://gerrit.wikimedia.org/r/374332 (https://phabricator.wikimedia.org/T152712) [13:54:48] (03PS1) 10Muehlenhoff: Extend Ganglia Apache config for LDAP authentication [puppet] - 10https://gerrit.wikimedia.org/r/375805 [14:01:41] (03PS1) 10Giuseppe Lavagetto: Add profile::openstack::main::rabbit_monitor_pass [labs/private] - 10https://gerrit.wikimedia.org/r/375807 [14:02:01] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add profile::openstack::main::rabbit_monitor_pass [labs/private] - 10https://gerrit.wikimedia.org/r/375807 (owner: 10Giuseppe Lavagetto) [14:03:22] (03PS1) 10Giuseppe Lavagetto: Add secrets for jobrunner.svc [labs/private] - 10https://gerrit.wikimedia.org/r/375809 [14:03:47] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add secrets for jobrunner.svc [labs/private] - 10https://gerrit.wikimedia.org/r/375809 (owner: 10Giuseppe Lavagetto) [14:04:28] (03PS2) 10Muehlenhoff: Extend Ganglia Apache config for LDAP authentication [puppet] - 10https://gerrit.wikimedia.org/r/375805 [14:04:41] (03PS2) 10Addshore: WIP DNM Add ::statistics::wmde::wikidata_concepts [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) [14:05:01] (03CR) 10jerkins-bot: [V: 04-1] WIP DNM Add ::statistics::wmde::wikidata_concepts [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) (owner: 10Addshore) [14:06:50] (03PS3) 10Muehlenhoff: Extend Ganglia Apache config for LDAP authentication [puppet] - 10https://gerrit.wikimedia.org/r/375805 [14:06:58] (03PS1) 10Addshore: Use the same name for group an user in wmde stats [puppet] - 10https://gerrit.wikimedia.org/r/375811 [14:07:26] (03CR) 10Elukey: [C: 031] Extend Ganglia Apache config for LDAP authentication [puppet] - 10https://gerrit.wikimedia.org/r/375805 (owner: 10Muehlenhoff) [14:08:37] (03CR) 10Giuseppe Lavagetto: [C: 032] Add entries for the jobrunner LVS service [dns] - 10https://gerrit.wikimedia.org/r/375747 (https://phabricator.wikimedia.org/T174599) (owner: 10Giuseppe Lavagetto) [14:08:49] (03PS2) 10ArielGlenn: convert "other" dump crons to use script to grabconfig settings [puppet] - 10https://gerrit.wikimedia.org/r/375791 (https://phabricator.wikimedia.org/T174929) [14:09:26] (03PS2) 10Addshore: Use the same name for group an user in wmde stats [puppet] - 10https://gerrit.wikimedia.org/r/375811 [14:09:35] (03PS3) 10Addshore: WIP DNM Add ::statistics::wmde::wikidata_concepts [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) [14:09:45] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: add nginx service [puppet] - 10https://gerrit.wikimedia.org/r/375800 (https://phabricator.wikimedia.org/T174599) (owner: 10Giuseppe Lavagetto) [14:09:55] (03CR) 10jerkins-bot: [V: 04-1] WIP DNM Add ::statistics::wmde::wikidata_concepts [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) (owner: 10Addshore) [14:10:39] (03PS1) 10Gehel: elasticsearch - deploy plugins with debian package instead of trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/375812 (https://phabricator.wikimedia.org/T158560) [14:11:13] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch - deploy plugins with debian package instead of trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/375812 (https://phabricator.wikimedia.org/T158560) (owner: 10Gehel) [14:13:06] (03CR) 10Filippo Giunchedi: "> @filippo, this role is applied to andrewclient.puppet.wmflabs.org" [puppet] - 10https://gerrit.wikimedia.org/r/375452 (owner: 10Andrew Bogott) [14:13:11] (03PS4) 10Addshore: WIP DNM Add ::statistics::wmde::wikidata_concepts [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) [14:13:32] (03CR) 10jerkins-bot: [V: 04-1] WIP DNM Add ::statistics::wmde::wikidata_concepts [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) (owner: 10Addshore) [14:14:31] (03PS2) 10Gehel: elasticsearch - deploy plugins with debian package instead of trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/375812 (https://phabricator.wikimedia.org/T158560) [14:14:59] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch - deploy plugins with debian package instead of trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/375812 (https://phabricator.wikimedia.org/T158560) (owner: 10Gehel) [14:15:11] (03PS5) 10Addshore: WIP DNM Add ::statistics::wmde::wikidata_concepts [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) [14:15:26] (03PS3) 10Gehel: elasticsearch - deploy plugins with debian package instead of trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/375812 (https://phabricator.wikimedia.org/T158560) [14:17:45] (03PS4) 10Muehlenhoff: Extend Ganglia Apache config for LDAP authentication [puppet] - 10https://gerrit.wikimedia.org/r/375805 [14:22:31] (03CR) 10DCausse: [C: 031] elasticsearch - deploy plugins with debian package instead of trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/375812 (https://phabricator.wikimedia.org/T158560) (owner: 10Gehel) [14:24:35] (03CR) 10Muehlenhoff: [C: 032] Extend Ganglia Apache config for LDAP authentication [puppet] - 10https://gerrit.wikimedia.org/r/375805 (owner: 10Muehlenhoff) [14:25:27] 10Operations, 10Analytics, 10EventBus, 10User-Elukey: Eventbus does not handle gracefully changes in DNS recursors - https://phabricator.wikimedia.org/T171048#3577723 (10elukey) a:05elukey>03None [14:27:35] !log enabled LDAP authentication for ganglia.wikimedia.org [14:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:29] (03PS3) 10ArielGlenn: convert "other" dump crons to use script to grabconfig settings [puppet] - 10https://gerrit.wikimedia.org/r/375791 (https://phabricator.wikimedia.org/T174929) [14:32:11] (03CR) 10Muehlenhoff: [C: 032] Fix OS detection [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375792 (owner: 10Muehlenhoff) [14:35:02] (03CR) 10Andrew Bogott: "> Which dashboard? What's the error you get?" [puppet] - 10https://gerrit.wikimedia.org/r/375452 (owner: 10Andrew Bogott) [14:37:44] RECOVERY - Check systemd state on chlorine is OK: OK - running: The system is fully operational [14:40:44] PROBLEM - Check systemd state on chlorine is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:44:06] (03PS4) 10Filippo Giunchedi: cassandra: reprovision restbase2003 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/375774 (https://phabricator.wikimedia.org/T169939) [14:44:34] ACKNOWLEDGEMENT - Check systemd state on chlorine is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Giuseppe Lavagetto Alex is working on the stretch conversion [14:44:34] ACKNOWLEDGEMENT - puppet last run on chlorine is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[kube-apiserver] Giuseppe Lavagetto Alex is working on the stretch conversion [14:44:34] ACKNOWLEDGEMENT - Check systemd state on kubernetes1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Giuseppe Lavagetto Alex is working on the stretch conversion [14:44:34] ACKNOWLEDGEMENT - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cni] Giuseppe Lavagetto Alex is working on the stretch conversion [14:46:46] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/7704/" [puppet] - 10https://gerrit.wikimedia.org/r/375774 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi) [14:47:19] (03CR) 10Filippo Giunchedi: [C: 032] cassandra: reprovision restbase2003 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/375774 (https://phabricator.wikimedia.org/T169939) (owner: 10Filippo Giunchedi) [14:48:09] (03PS1) 10Muehlenhoff: Mark as releasedx [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375818 [14:48:20] (03PS2) 10Muehlenhoff: Mark as released [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375818 [14:51:03] !log reimage restbase2003 for use with cassandra 3 / jbod - T169939 [14:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:15] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [14:52:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3577868 (10Andrew) It's down again, icinga says since 2017-08-30 16:14:18 [14:54:00] (03CR) 10Muehlenhoff: [C: 032] Mark as released [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375818 (owner: 10Muehlenhoff) [15:15:29] 10Operations, 10monitoring, 10netops, 10User-fgiunchedi: Grafana dashboards for librenms graphite data - https://phabricator.wikimedia.org/T171823#3577956 (10fgiunchedi) I checked librenms' readings for current/voltage and indeed match what's being pushed to graphite. The default aggregation in graphite we... [15:31:58] (03PS1) 10Andrew Bogott: rabbitmq: add a giant default config [puppet] - 10https://gerrit.wikimedia.org/r/375822 (https://phabricator.wikimedia.org/T170492) [15:36:21] (03PS2) 10Andrew Bogott: rabbitmq: add a giant default config [puppet] - 10https://gerrit.wikimedia.org/r/375822 (https://phabricator.wikimedia.org/T170492) [15:37:36] RECOVERY - Check systemd state on chlorine is OK: OK - running: The system is fully operational [15:40:31] (03PS3) 10Giuseppe Lavagetto: jobrunner: add LVS service configuration [puppet] - 10https://gerrit.wikimedia.org/r/375801 (https://phabricator.wikimedia.org/T174599) [15:40:53] PROBLEM - Check systemd state on chlorine is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:42:57] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,service=nginx [15:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:44] PROBLEM - puppet last run on restbase2003 is CRITICAL: Return code of 255 is out of bounds [15:43:53] PROBLEM - cassandra-a CQL 10.192.32.134:9042 on restbase2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:44:44] PROBLEM - Check size of conntrack table on restbase2003 is CRITICAL: Return code of 255 is out of bounds [15:44:44] PROBLEM - salt-minion processes on restbase2003 is CRITICAL: Return code of 255 is out of bounds [15:44:59] (03PS1) 10Andrew Bogott: rabbitmq: increase vm_memory_high_watermark_paging_ratio [puppet] - 10https://gerrit.wikimedia.org/r/375823 (https://phabricator.wikimedia.org/T170492) [15:45:09] (03PS7) 10Filippo Giunchedi: Cassandra: Do not include the main DNS in the list of seeds [puppet] - 10https://gerrit.wikimedia.org/r/370554 (https://phabricator.wikimedia.org/T172610) (owner: 10Mobrovac) [15:50:09] (03CR) 10Filippo Giunchedi: [C: 032] Cassandra: Do not include the main DNS in the list of seeds [puppet] - 10https://gerrit.wikimedia.org/r/370554 (https://phabricator.wikimedia.org/T172610) (owner: 10Mobrovac) [15:50:21] (03CR) 10Giuseppe Lavagetto: [C: 032] jobrunner: add LVS service configuration [puppet] - 10https://gerrit.wikimedia.org/r/375801 (https://phabricator.wikimedia.org/T174599) (owner: 10Giuseppe Lavagetto) [15:50:40] (03PS4) 10Giuseppe Lavagetto: jobrunner: add LVS service configuration [puppet] - 10https://gerrit.wikimedia.org/r/375801 (https://phabricator.wikimedia.org/T174599) [15:53:43] (03PS1) 10Mobrovac: RESTBase: Remove restbase10(0[89]|10) from the list of seeds [puppet] - 10https://gerrit.wikimedia.org/r/375825 (https://phabricator.wikimedia.org/T169939) [15:56:13] (03CR) 10Mobrovac: "PCC looking good - https://puppet-compiler.wmflabs.org/compiler02/7708/" [puppet] - 10https://gerrit.wikimedia.org/r/375825 (https://phabricator.wikimedia.org/T169939) (owner: 10Mobrovac) [15:56:27] godog: ^^^ [15:58:35] mobrovac: ditto for codfw I suppose [15:59:00] godog: codfw already excludes 200[135] so no changes are needed there [16:01:35] mobrovac: 2005 is still in restbase::seeds afaics in hieradata/role/codfw/restbase/production.yaml [16:01:52] * mobrovac looking [16:02:46] it's in restbase::hosts which is for rate limiting, but yeah, let's be thorough and remove that too [16:02:46] * mobrovac amending [16:03:44] (03PS2) 10Mobrovac: RESTBase: Remove restbase10(0[89]|10) from the list of seeds [puppet] - 10https://gerrit.wikimedia.org/r/375825 (https://phabricator.wikimedia.org/T169939) [16:03:49] godog: {{done}} ^& [16:03:55] 10Operations, 10Cloud-Services, 10Cloud-VPS, 10Patch-For-Review, 10Wikimedia-Incident: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#3578005 (10Andrew) This no longer happens as a matter of course, but anytime designate-sink locks up we leak things fo... [16:05:20] (03CR) 10Filippo Giunchedi: [C: 032] RESTBase: Remove restbase10(0[89]|10) from the list of seeds [puppet] - 10https://gerrit.wikimedia.org/r/375825 (https://phabricator.wikimedia.org/T169939) (owner: 10Mobrovac) [16:05:51] mobrovac: neat, {{done}} [16:06:17] cool, thnx godog, i'll run puppet and restart RB afterwards [16:06:22] * elukey off! [16:07:20] <_joe_> !log rolling restart of pybal in codfw/eqiad low-traffic pools for picking up the new jobrunner LVS endpoint [16:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:03] RECOVERY - Check systemd state on chlorine is OK: OK - running: The system is fully operational [16:08:44] _joe_: \o/ [16:08:57] <_joe_> don't be too happy [16:08:59] <_joe_> :P [16:09:06] 10Operations, 10media-storage: swift-recon-cron on ms-be203[34]: [Errno 17] File exists: '/var/lock/swift-recon-object-cron' - https://phabricator.wikimedia.org/T174959#3578014 (10ema) [16:09:47] 10Operations, 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Zuul: Zuul has started failing on some repo's in gerrit.wikimedia.org - https://phabricator.wikimedia.org/T153877#2893745 (10Addshore) Just ran into this again while trying to do "apt-get install... [16:10:13] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - jobrunner_443 - Could not depool server mw2247.codfw.wmnet because of too many down! [16:10:22] !log restbase depooled restbase10(0[89]|10) and restbase2005 for T169939 [16:10:23] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw2159.codfw.wmnet, mw2249.codfw.wmnet, mw2247.codfw.wmnet]) [16:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:35] T169939: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939 [16:11:00] <_joe_> that's me [16:11:03] <_joe_> I forgot to add a ferm rule [16:11:04] PROBLEM - Check systemd state on chlorine is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:12:14] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - jobrunner_443 - Could not depool server mw1300.eqiad.wmnet because of too many down! [16:13:40] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner_tls: add ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/375828 [16:13:53] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::jobrunner_tls: add ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/375828 (owner: 10Giuseppe Lavagetto) [16:16:24] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1306.eqiad.wmnet, mw1164.eqiad.wmnet, mw1300.eqiad.wmnet]) [16:16:43] <_joe_> ema: ^^ that's wrong [16:17:23] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [16:18:14] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [16:20:28] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal [16:21:17] _joe_: the eqiad one is wrong while the codfw one is right? [16:21:28] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal [16:21:29] <_joe_> those servers were known to pybal [16:21:35] <_joe_> just pooled while being down [16:21:41] <_joe_> because of the depool threshold [16:22:28] RECOVERY - salt-minion processes on restbase2003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:22:29] RECOVERY - Check size of conntrack table on restbase2003 is OK: OK: nf_conntrack is 0 % full [16:23:14] _joe_: oh, yeah, the check looks for enabled/up/pooled hosts [16:23:24] !log mobrovac@tin Started deploy [restbase/deploy@8c9c436]: (no justification provided) [16:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:55] _joe_: perhaps we should change the message to "hosts in ipvs but unknown/down in PyBal"? [16:26:14] <_joe_> well we already have another alert for that specifically [16:26:26] <_joe_> PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - jobrunner_443 - Could not depool server mw2247.codfw.wmnet because of too [16:26:29] <_joe_> many down! [16:27:12] (03PS1) 10Filippo Giunchedi: prometheus: fix extra restbase/cassandra metrics rule [puppet] - 10https://gerrit.wikimedia.org/r/375832 [16:27:40] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: fix extra restbase/cassandra metrics rule [puppet] - 10https://gerrit.wikimedia.org/r/375832 (owner: 10Filippo Giunchedi) [16:29:47] !log mobrovac@tin Finished deploy [restbase/deploy@8c9c436]: (no justification provided) (duration: 06m 23s) [16:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:49] 10Operations, 10Traffic, 10Wikimedia-Logstash: Varnish does not vary elasticsearch query by request body - https://phabricator.wikimedia.org/T174960#3578104 (10mobrovac) [16:46:53] RECOVERY - puppet last run on restbase2003 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:47:24] (03PS4) 10Gehel: wdqs - logging pattern to conform to the logback MDCInsertingServletFilter [puppet] - 10https://gerrit.wikimedia.org/r/374513 (https://phabricator.wikimedia.org/T172710) [16:57:13] (03PS13) 10Paladox: gerrit: DO NOT MERGE [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738 [16:57:15] (03PS12) 10Paladox: Gerrit: Upgrading gerrit to 2.14.3-pre (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363734 [16:57:27] (03PS13) 10Paladox: Gerrit: Upgrading gerrit to 2.14.4-pre (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363734 [17:00:04] gehel: Respected human, time to deploy Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170904T1700). Please do the needful. [17:00:04] Smalyshev and gehel: A patch you scheduled for Wikidata Query Service weekly deploy is about to be deployed. Please be available during the process. [17:13:18] (03CR) 10Gehel: [C: 032] wdqs - logging pattern to conform to the logback MDCInsertingServletFilter [puppet] - 10https://gerrit.wikimedia.org/r/374513 (https://phabricator.wikimedia.org/T172710) (owner: 10Gehel) [17:17:07] !log gehel@tin Started deploy [wdqs/wdqs@1caaa30]: weekly WDQS deployment [17:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:54] !log gehel@tin Finished deploy [wdqs/wdqs@1caaa30]: weekly WDQS deployment (duration: 02m 46s) [17:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:20] SMalyshev: wdqs deployment done, tests are green [17:21:32] cool, thanks! [17:50:22] !log uploaded debdeploy 0.0.99-2 to apt.wikimedia.org [17:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:43] (03CR) 10Krinkle: Make values stackable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [18:05:42] PROBLEM - puppet last run on cp4024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:06:22] PROBLEM - Check systemd state on cp4024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:17:46] (03CR) 10Krinkle: Make values stackable (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/375345 (https://phabricator.wikimedia.org/T104902) (owner: 10Phedenskog) [19:04:55] (03PS2) 10Gilles: Add STL support to Thumbor, behind flag [puppet] - 10https://gerrit.wikimedia.org/r/375781 (https://phabricator.wikimedia.org/T161719) [19:05:09] (03CR) 10Gilles: Add STL support to Thumbor, behind flag (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/375781 (https://phabricator.wikimedia.org/T161719) (owner: 10Gilles) [19:08:51] (03PS4) 10Gilles: Add logback filter for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/365619 (https://phabricator.wikimedia.org/T150734) [19:08:52] (03PS1) 10Gilles: Send Thumbor error log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/375859 (https://phabricator.wikimedia.org/T150734) [19:09:49] (03PS1) 10Merlijn van Deen: [labs] Rename wmflabs-* files to wmcs-* [puppet] - 10https://gerrit.wikimedia.org/r/375860 [19:10:23] (03CR) 10jerkins-bot: [V: 04-1] [labs] Rename wmflabs-* files to wmcs-* [puppet] - 10https://gerrit.wikimedia.org/r/375860 (owner: 10Merlijn van Deen) [19:16:43] (03PS2) 10Gilles: Only uses lua in Thumbor Nginx config if "extras" variant [puppet] - 10https://gerrit.wikimedia.org/r/375772 (https://phabricator.wikimedia.org/T174746) [19:17:12] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2053402 [19:17:30] (03PS3) 10Gilles: Thumbor: only use lua in nginx config if "extras" variant [puppet] - 10https://gerrit.wikimedia.org/r/375772 (https://phabricator.wikimedia.org/T174746) [19:32:47] (03PS2) 10Merlijn van Deen: [labs] Rename wmflabs-* files to wmcs-* [puppet] - 10https://gerrit.wikimedia.org/r/375860 (https://phabricator.wikimedia.org/T174082) [19:33:09] (03CR) 10Merlijn van Deen: [C: 04-1] "Let me -1 this myself, as it's not entirely clear whether we should do this." [puppet] - 10https://gerrit.wikimedia.org/r/375860 (https://phabricator.wikimedia.org/T174082) (owner: 10Merlijn van Deen) [19:34:18] (03PS4) 10Gilles: Thumbor: only use lua in nginx config if "extras" variant [puppet] - 10https://gerrit.wikimedia.org/r/375772 (https://phabricator.wikimedia.org/T174746) [19:52:11] (03CR) 10GoranSMilovanovic: [C: 031] WIP DNM Add ::statistics::wmde::wikidata_concepts [puppet] - 10https://gerrit.wikimedia.org/r/369902 (https://phabricator.wikimedia.org/T171258) (owner: 10Addshore) [20:26:03] 10Operations, 10MediaWiki-Platform-Team, 10Availability (Multiple-active-datacenters), 10Performance-Team (Radar), and 4 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3578406 (10Krinkle) [21:00:27] (03Draft1) 10Paladox: Gerrit: Set base url for commitlink [puppet] - 10https://gerrit.wikimedia.org/r/375922 [21:00:31] (03PS2) 10Paladox: Gerrit: Set base url for commitlink [puppet] - 10https://gerrit.wikimedia.org/r/375922 [21:02:33] (03PS3) 10Paladox: Gerrit: Set base url for commitlink [puppet] - 10https://gerrit.wikimedia.org/r/375922 [21:05:13] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 2054028 [21:07:13] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 228 [21:08:49] (03CR) 10Paladox: [C: 031] "Tested locally and works in both the old ui and new ui :)." [puppet] - 10https://gerrit.wikimedia.org/r/375922 (owner: 10Paladox) [21:35:30] 10Operations, 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Zuul: Zuul has started failing on some repo's in gerrit.wikimedia.org - https://phabricator.wikimedia.org/T153877#3578497 (10hashar) Yup you need python-pbr 0.8 until I manage to find the time to... [21:36:29] (03CR) 10Krinkle: [C: 031] "SGTM" [puppet] - 10https://gerrit.wikimedia.org/r/262068 (https://phabricator.wikimedia.org/T71581) (owner: 10Reedy) [22:07:35] !log starting script to reparse all pages in parsoid for Linter (python2 parsoid-reparse.py http://parsoid.discovery.wmnet:8000 --sitematrix --linter-only --skip-closed https://aa.wikipedia.org/w/api.php) - T161556 [22:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:50] T161556: Implement a way to have linter reprocess all pages - https://phabricator.wikimedia.org/T161556 [22:37:32] RECOVERY - Check systemd state on chlorine is OK: OK - running: The system is fully operational [22:40:32] PROBLEM - Check systemd state on chlorine is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:22:05] (03CR) 10Krinkle: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/375106 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle) [23:32:21] !log Set e-mail to Hexabot bot account to allow account recovery (T174973) [23:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:36] T174973: Recover Hexabot account - https://phabricator.wikimedia.org/T174973 [23:35:22] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 697