[00:29:28] <icinga-wm>	 PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3035_v6, cp3038_v6
[00:31:31] <icinga-wm>	 RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK
[01:19:18] <icinga-wm>	 PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 21 not-conn: cp3017_v6, cp4012_v6, cp4020_v6
[01:21:10] <icinga-wm>	 RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK
[01:31:31] <Krinkle>	 !log mwscript deleteEqualMessages.php --wiki sqwiki
[01:31:39] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:34:39] <icinga-wm>	 PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 59 connecting: (unnamed) not-conn: cp3045_v6
[01:35:50] <icinga-wm>	 PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (103515s 100000s)
[01:38:39] <icinga-wm>	 RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK
[01:46:39] <icinga-wm>	 PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2024_v6
[01:50:39] <icinga-wm>	 RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK
[02:15:08] <icinga-wm>	 PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2024_v6
[02:17:08] <icinga-wm>	 RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK
[02:26:05] <logmsgbot>	 !log l10nupdate@tin Synchronized php-1.26wmf22/cache/l10n: l10nupdate for 1.26wmf22 (duration: 06m 59s)
[02:26:17] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:29:49] <logmsgbot>	 !log l10nupdate@tin LocalisationUpdate completed (1.26wmf22) at 2015-09-14 02:29:48+00:00
[02:29:57] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:38:59] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0]
[02:47:00] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[02:48:09] <icinga-wm>	 PROBLEM - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp4012_v6
[02:50:10] <icinga-wm>	 RECOVERY - IPsec on cp1059 is OK: Strongswan OK - 24 ESP OK
[03:02:06] <grrrit-wm>	 (03PS1) 10Tim Landscheidt: Tools: Migrate from labsdebrepo to aptly [puppet] - 10https://gerrit.wikimedia.org/r/238089 
[03:05:58] <icinga-wm>	 PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied
[03:09:55] <grrrit-wm>	 (03CR) 10Tim Landscheidt: "Tested the pinning on a Precise instance in Toolsbeta (after setting up toolsbeta-webproxy-01 as an aptly server):" [puppet] - 10https://gerrit.wikimedia.org/r/238089 (owner: 10Tim Landscheidt)
[03:28:00] <icinga-wm>	 RECOVERY - Disk space on labstore1002 is OK: DISK OK
[04:04:38] <icinga-wm>	 PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied
[04:09:28] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=737.91 Read Requests/Sec=4338.45 Write Requests/Sec=593.37 KBytes Read/Sec=23464.04 KBytes_Written/Sec=2373.47
[04:11:20] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.40 Read Requests/Sec=0.00 Write Requests/Sec=1.60 KBytes Read/Sec=0.00 KBytes_Written/Sec=6.40
[04:30:49] <icinga-wm>	 RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (-6457 100000s)
[04:34:27] <Krenair>	 the wikitech-static check seems to fail and recover regularly. does the threshold just need to be put up a bit?
[04:47:58] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Sep 14 04:47:58 UTC 2015 (duration 47m 57s)
[04:48:07] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[05:02:52] <grrrit-wm>	 (03PS3) 10Ori.livneh: mediawiki: kill HHVM graphite checks [puppet] - 10https://gerrit.wikimedia.org/r/237998 (owner: 10Faidon Liambotis)
[05:03:06] <grrrit-wm>	 (03PS4) 10Ori.livneh: mediawiki: kill HHVM graphite checks [puppet] - 10https://gerrit.wikimedia.org/r/237998 (owner: 10Faidon Liambotis)
[05:03:27] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] mediawiki: kill HHVM graphite checks [puppet] - 10https://gerrit.wikimedia.org/r/237998 (owner: 10Faidon Liambotis)
[05:09:35] <grrrit-wm>	 (03PS1) 10Ori.livneh: Get rid of mw:monitoring:webserver [puppet] - 10https://gerrit.wikimedia.org/r/238091 
[05:11:22] <ori>	 yay, the catalog compiler works again
[05:15:02] <grrrit-wm>	 (03PS2) 10Ori.livneh: mediawiki:monitoring:webserver: ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/238091 
[05:15:36] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] mediawiki:monitoring:webserver: ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/238091 (owner: 10Ori.livneh)
[05:20:59] <icinga-wm>	 PROBLEM - puppet last run on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:24:18] <icinga-wm>	 PROBLEM - Check size of conntrack table on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:24:39] <icinga-wm>	 PROBLEM - DPKG on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:25:00] <icinga-wm>	 PROBLEM - dhclient process on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:25:09] <icinga-wm>	 PROBLEM - Disk space on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:25:29] <icinga-wm>	 PROBLEM - RAID on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:25:30] <icinga-wm>	 PROBLEM - configured eth on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:25:31] <icinga-wm>	 PROBLEM - salt-minion processes on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:25:49] <icinga-wm>	 PROBLEM - spamassassin on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:01:17] <icinga-wm>	 PROBLEM - NTP on mendelevium is CRITICAL: NTP CRITICAL: No response from NTP server
[06:27:39] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds
[06:29:47] <icinga-wm>	 PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:29:48] <icinga-wm>	 PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:08] <icinga-wm>	 PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: puppet fail
[06:30:28] <icinga-wm>	 PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:30:28] <icinga-wm>	 PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:29] <icinga-wm>	 PROBLEM - puppet last run on db2060 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:30:38] <icinga-wm>	 PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:28] <icinga-wm>	 PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:48] <icinga-wm>	 PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:58] <icinga-wm>	 PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:39] <icinga-wm>	 PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:15] <wikibugs>	 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-Requests: Bhojpuri wikipedia should start with 'bho' instead of 'bh' to avoid confusion with Bihari - https://phabricator.wikimedia.org/T41968#1636255 (10Liuxinyu970226)
[06:37:09] <wikibugs>	 6operations, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1636259 (10Menner)
[06:39:40] <wikibugs>	 6operations, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1636272 (10Menner)
[06:39:41] <wikibugs>	 6operations, 10Wikimedia-SVG-rendering, 7Upstream: Filter effect Gaussian blur filter not rendered correctly for small to medium thumbnail sizes - https://phabricator.wikimedia.org/T44090#1636271 (10Menner)
[06:42:48] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds
[06:46:10] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds
[06:49:17] <icinga-wm>	 PROBLEM - OTRS SMTP on mendelevium is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:56:18] <icinga-wm>	 RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[06:56:19] <icinga-wm>	 PROBLEM - RAID on db1043 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded)
[06:56:28] <icinga-wm>	 RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:56:38] <icinga-wm>	 RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[06:56:39] <icinga-wm>	 RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[06:56:49] <icinga-wm>	 RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[06:56:58] <icinga-wm>	 RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:56:59] <icinga-wm>	 RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:07] <icinga-wm>	 RECOVERY - puppet last run on db2060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:17] <icinga-wm>	 RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:37] <icinga-wm>	 RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures
[06:57:58] <icinga-wm>	 RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:02:50] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] "good chance to test T109711 too" [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) (owner: 10Muehlenhoff)
[07:04:17] <icinga-wm>	 PROBLEM - SSH on mendelevium is CRITICAL: Server answer
[07:05:57] <icinga-wm>	 RECOVERY - SSH on mendelevium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[07:06:05] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "good to merge, minor error in units" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/225292 (owner: 10Nemo bis)
[07:09:20] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] Slightly increase RESTBase job runner concurrency [puppet] - 10https://gerrit.wikimedia.org/r/237868 (owner: 10GWicke)
[07:10:10] <_joe_>	 mendelevium?
[07:10:21] <_joe_>	 ach some catchup to do I'd say :)
[07:17:51] <godog>	 _joe_: yeah we're playing biology with hosts
[07:18:29] <grrrit-wm>	 (03PS4) 10Nemo bis: Add some more redis monitoring metrics to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/225292 
[07:20:39] <icinga-wm>	 PROBLEM - SSH on mendelevium is CRITICAL: Server answer
[07:21:05] <grrrit-wm>	 (03Abandoned) 10Giuseppe Lavagetto: labstore: fix replication checks [puppet] - 10https://gerrit.wikimedia.org/r/234490 (owner: 10Giuseppe Lavagetto)
[07:22:18] <icinga-wm>	 RECOVERY - SSH on mendelevium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[07:23:45] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Backport of D45165: Limit log message length for unserialize failures [debs/hhvm] - 10https://gerrit.wikimedia.org/r/237862 (owner: 10BryanDavis)
[07:23:54] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [V: 032] Backport of D45165: Limit log message length for unserialize failures [debs/hhvm] - 10https://gerrit.wikimedia.org/r/237862 (owner: 10BryanDavis)
[07:30:33] <grrrit-wm>	 (03PS5) 10Filippo Giunchedi: Add some more redis monitoring metrics to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/225292 (owner: 10Nemo bis)
[07:30:40] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Add some more redis monitoring metrics to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/225292 (owner: 10Nemo bis)
[07:30:48] <icinga-wm>	 PROBLEM - SSH on mendelevium is CRITICAL: Server answer
[07:37:18] <icinga-wm>	 RECOVERY - SSH on mendelevium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[07:38:38] <grrrit-wm>	 (03PS6) 10KartikMistry: CX: Enable suggestion for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237327 (https://phabricator.wikimedia.org/T112498) 
[07:42:45] <grrrit-wm>	 (03PS1) 10KartikMistry: CX: Enable Suggestions in ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238097 (https://phabricator.wikimedia.org/T111901) 
[07:48:34] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Create ferm rules for Hadoop master and Hadoop standby (common rules) [puppet] - 10https://gerrit.wikimedia.org/r/237335 
[07:48:49] <icinga-wm>	 PROBLEM - SSH on mendelevium is CRITICAL: Server answer
[07:52:23] <godog>	 !log reboot ms-be1010 to pick up disk ordering change
[07:52:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[07:56:47] <icinga-wm>	 RECOVERY - Disk space on ms-be1010 is OK: DISK OK
[08:02:28] <icinga-wm>	 RECOVERY - SSH on mendelevium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[08:06:12] <wikibugs>	 6operations, 10MediaWiki-General-or-Unknown, 5Patch-For-Review: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1636375 (10Joe) 3NEW a:3Joe
[08:07:28] <icinga-wm>	 PROBLEM - SSH on mendelevium is CRITICAL: Server answer
[08:11:37] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds
[08:12:37] <icinga-wm>	 RECOVERY - SSH on mendelevium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[08:13:08] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: poolcounter: Add configuration for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238099 
[08:16:27] <wikibugs>	 6operations, 10ops-eqiad: db1043 degraded RAID - https://phabricator.wikimedia.org/T112502#1636393 (10jcrespo) 3NEW
[08:16:39] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds
[08:17:38] <icinga-wm>	 PROBLEM - SSH on mendelevium is CRITICAL: Server answer
[08:17:48] <icinga-wm>	 ACKNOWLEDGEMENT - RAID on db1043 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Jcrespo https://phabricator.wikimedia.org/T112502
[08:21:45] <jynus>	 I am running a CPU intensive task on db1043 (but on only 1 CPU), let me know if you see some slow down on phabricator (you shouldn't)
[08:22:24] <jynus>	 (uptime load is 1)
[08:23:14] <mafk>	 jynus you breaking things? :) (buenos días)
[08:23:47] <jynus>	 mafk, only when I have the time
[08:23:54] <mafk>	 :D
[08:24:11] <godog>	 mobrovac: I'm about to start renaming cassandra test cluster btw T112257
[08:24:18] <icinga-wm>	 RECOVERY - SSH on mendelevium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[08:25:00] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[08:29:18] <wikibugs>	 6operations, 10ops-eqiad: db1043 degraded RAID - https://phabricator.wikimedia.org/T112502#1636409 (10jcrespo)
[08:30:44] <jynus>	 !log endinf profiling and executing pt-query-digest on db1043 [ETA:4h]
[08:30:49] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[08:31:09] <icinga-wm>	 PROBLEM - SSH on mendelevium is CRITICAL: Server answer
[08:34:28] <icinga-wm>	 RECOVERY - SSH on mendelevium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[08:39:37] <icinga-wm>	 PROBLEM - SSH on mendelevium is CRITICAL: Server answer
[08:41:17] <icinga-wm>	 RECOVERY - SSH on mendelevium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[08:43:23] <wikibugs>	 6operations: staged dumps: use the "cutoff" option as little as possible - https://phabricator.wikimedia.org/T110305#1636436 (10ArielGlenn) 5Open>3Resolved this worked fine for the September run, closing.
[08:43:24] <wikibugs>	 6operations, 7Tracking: staged dumps implementation - https://phabricator.wikimedia.org/T107757#1636438 (10ArielGlenn)
[08:43:50] <wikibugs>	 6operations, 7Tracking: staged dumps implementation - https://phabricator.wikimedia.org/T107757#1502963 (10ArielGlenn)
[08:43:50] <wikibugs>	 6operations: worker bash script terminates early when there are still more wikis to run - https://phabricator.wikimedia.org/T107759#1636446 (10ArielGlenn) 5Open>3Resolved September run looked ok, closing.
[08:44:27] <grrrit-wm>	 (03CR) 10Hashar: "Will probably be in conflict with https://gerrit.wikimedia.org/r/#/c/220308/ which is currently cherry picked on the integration puppetmas" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) (owner: 10Zfilipin)
[08:44:30] <godog>	 !log silence mendelevium for today, status unclear T111532
[08:44:33] <godog>	 akosiaris: ^
[08:44:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[08:44:51] <grrrit-wm>	 (03PS1) 10Jcrespo: Depool es1002, es1005, es1008 for decommission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238102 
[08:45:08] <wikibugs>	 6operations: Make dumps run via cron on each snapshot host - https://phabricator.wikimedia.org/T107750#1636457 (10ArielGlenn)
[08:45:09] <wikibugs>	 6operations: need script that handles all bash worker scripts on a given snapshot, per stage, rerunning failures as appropriate, managing resources as appropriate - https://phabricator.wikimedia.org/T107760#1636455 (10ArielGlenn) 5Open>3Resolved Partial run completed fine, September full run is in last phase...
[08:45:49] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 031] Depool es1002, es1005, es1008 for decommission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238102 (owner: 10Jcrespo)
[08:46:21] <wikibugs>	 6operations: Make dumps run via cron on each snapshot host - https://phabricator.wikimedia.org/T107750#1502816 (10ArielGlenn)
[08:46:22] <wikibugs>	 6operations, 7Tracking: staged dumps implementation - https://phabricator.wikimedia.org/T107757#1636458 (10ArielGlenn) 5Open>3Resolved done. closing.
[08:47:09] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: cassandra: adjust test cluster name [puppet] - 10https://gerrit.wikimedia.org/r/237643 (https://phabricator.wikimedia.org/T112257) 
[08:47:16] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: adjust test cluster name [puppet] - 10https://gerrit.wikimedia.org/r/237643 (https://phabricator.wikimedia.org/T112257) (owner: 10Filippo Giunchedi)
[08:52:37] <godog>	 !log rename cassandra test cluster and restart
[08:52:42] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[08:55:49] <icinga-wm>	 PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: /page/html/{title} is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/data-parsoid/{title} is CRITICAL: Test Get data-parsoid by title returned the unexpected status 500 (expecting: 200): /page/title/{title} is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200)
[08:58:46] <godog>	 looking ^
[09:00:18] <grrrit-wm>	 (03PS8) 10Hashar: ci: Role for running Raita [puppet] - 10https://gerrit.wikimedia.org/r/208024 (owner: 10Dduvall)
[09:01:01] <grrrit-wm>	 (03CR) 10Hashar: [C: 031 V: 032] "Rebased. That is applied on labs maybe we can get this change added to the next PuppetSWAT ?" [puppet] - 10https://gerrit.wikimedia.org/r/208024 (owner: 10Dduvall)
[09:01:09] <icinga-wm>	 RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy
[09:02:09] <grrrit-wm>	 (03PS4) 10Hashar: contint: Install chromedriver for running MW-Selenium tests [puppet] - 10https://gerrit.wikimedia.org/r/223691 (https://phabricator.wikimedia.org/T103039) (owner: 10Dduvall)
[09:02:54] <grrrit-wm>	 (03PS3) 10Hashar: contint: apt conf and python packages for light slaves [puppet] - 10https://gerrit.wikimedia.org/r/226715 (https://phabricator.wikimedia.org/T103972) 
[09:03:14] <grrrit-wm>	 (03CR) 10Hashar: [C: 031] contint: apt conf and python packages for light slaves [puppet] - 10https://gerrit.wikimedia.org/r/226715 (https://phabricator.wikimedia.org/T103972) (owner: 10Hashar)
[09:05:16] <grrrit-wm>	 (03CR) 10Hashar: "_joe_ can you please merge this one in? It has been done for the etcd python jobs so we can ship on the Jessie slave the python utilities" [puppet] - 10https://gerrit.wikimedia.org/r/226715 (https://phabricator.wikimedia.org/T103972) (owner: 10Hashar)
[09:06:58] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] ci: Role for running Raita (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/208024 (owner: 10Dduvall)
[09:07:57] <wikibugs>	 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra inter-node encryption (TLS) - https://phabricator.wikimedia.org/T108953#1636498 (10fgiunchedi)
[09:08:00] <wikibugs>	 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: rename cassandra test cluster - https://phabricator.wikimedia.org/T112257#1636496 (10fgiunchedi) 5Open>3Resolved rename has been successful, it involved following the above procedure and rolling-restart the cluster. of course since this...
[09:08:46] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] contint: apt conf and python packages for light slaves [puppet] - 10https://gerrit.wikimedia.org/r/226715 (https://phabricator.wikimedia.org/T103972) (owner: 10Hashar)
[09:08:48] <grrrit-wm>	 (03PS2) 10ArielGlenn: fixes for cert cleaner script for labs [puppet] - 10https://gerrit.wikimedia.org/r/237626 
[09:08:53] <jynus>	 !log applying schema change to flowdb
[09:08:58] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:09:25] <grrrit-wm>	 (03PS3) 10ArielGlenn: fixes for cert cleaner script for labs [puppet] - 10https://gerrit.wikimedia.org/r/237626 
[09:10:03] <hashar>	 _joe_: thank you :)
[09:10:26] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] fixes for cert cleaner script for labs [puppet] - 10https://gerrit.wikimedia.org/r/237626 (owner: 10ArielGlenn)
[09:12:07] <_joe_>	 uh someone merged the change already?
[09:16:01] <wikibugs>	 6operations: sysctl::parameters don't take effect until next reboot (on Trusty at least) - https://phabricator.wikimedia.org/T109711#1636517 (10MoritzMuehlenhoff) In my tests the class works per se, the confusion might stem from the fact that the net.netfilter.nf_conntrack_buckets value cannot be changed the sam...
[09:26:13] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 031] Slightly increase RESTBase job runner concurrency [puppet] - 10https://gerrit.wikimedia.org/r/237868 (owner: 10GWicke)
[09:35:43] <grrrit-wm>	 (03PS1) 10DCausse: Upgrade to extra plugin 1.7.1 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/238105 (https://phabricator.wikimedia.org/T112499) 
[09:36:42] <grrrit-wm>	 (03CR) 10DCausse: [C: 04-1] "Should not be merged now" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/238105 (https://phabricator.wikimedia.org/T112499) (owner: 10DCausse)
[09:38:58] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: redis: match ganglia monitoring configuration with latest changes [puppet] - 10https://gerrit.wikimedia.org/r/238106 
[09:42:05] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: redis: match ganglia monitoring configuration with latest changes [puppet] - 10https://gerrit.wikimedia.org/r/238106 
[09:45:21] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] redis: match ganglia monitoring configuration with latest changes [puppet] - 10https://gerrit.wikimedia.org/r/238106 (owner: 10Filippo Giunchedi)
[09:51:17] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds
[09:55:00] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: poolcounter: add connect_timeout in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238108 (https://phabricator.wikimedia.org/T105378) 
[09:55:02] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: poolcounter: enable connect_timeout for testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238109 (https://phabricator.wikimedia.org/T105378) 
[09:55:56] <wikibugs>	 6operations, 10MediaWiki-General-or-Unknown, 5Patch-For-Review: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1636584 (10Joe) Cannot make a new instance communicate with the deployment-prep puppetmaster. @andrewbogott any help would be appreciated.
[09:57:05] <wikibugs>	 6operations, 10MediaWiki-General-or-Unknown, 5Patch-For-Review: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1636585 (10Joe)
[10:02:48] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Raise default conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) 
[10:04:09] <jynus>	 !log db1029 (x1-master) temporarily saturated by connections- flow was unresponsive for 10 minutes; migration partially aborted
[10:04:15] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:16:39] <wikibugs>	 6operations, 7discovery-system: Remove etcd1001,2 from the etcd cluster, decommission them. - https://phabricator.wikimedia.org/T108010#1636600 (10Joe) 5Open>3Resolved
[10:19:18] <grrrit-wm>	 (03PS3) 10Muehlenhoff: Raise default conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) 
[10:25:32] <wikibugs>	 6operations, 10Traffic, 7Monitoring, 7Pybal: Implement pybal pool state monitoring and alerting via icinga - https://phabricator.wikimedia.org/T102394#1636617 (10Joe) I don't think it would get to be much easier, no.  What we need is for pybal to write its state to disk or to expose it in some other way.
[10:25:50] <grrrit-wm>	 (03PS2) 10Catrope: Set $wgFlowMigrateReferenceWiki false on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234207 (https://phabricator.wikimedia.org/T107204) (owner: 10Mattflaschen)
[10:25:52] <grrrit-wm>	 (03PS1) 10Catrope: Set $wgFlowMigrateReferenceWiki to false in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238111 (https://phabricator.wikimedia.org/T107204) 
[10:33:22] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Depool es1002, es1005, es1008 for decommission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238102 (owner: 10Jcrespo)
[10:35:10] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1002, es1005, es1008 (duration: 00m 12s)
[10:35:15] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:41:35] <grrrit-wm>	 (03PS4) 10Muehlenhoff: Raise default conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) 
[10:44:04] <grrrit-wm>	 (03PS5) 10Muehlenhoff: Raise default conntrack table size (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) 
[10:48:56] <grrrit-wm>	 (03CR) 10Jean-Frédéric: [C: 031] Add *.ggpht.com to Wikimedia Commons upload whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234980 (https://phabricator.wikimedia.org/T110869) (owner: 10Dereckson)
[10:52:58] <icinga-wm>	 PROBLEM - puppet last run on mw1075 is CRITICAL: CRITICAL: Puppet has 1 failures
[10:57:37] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 22 data above and 8 below the confidence bounds
[11:01:07] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 21 data above and 8 below the confidence bounds
[11:03:02] <grrrit-wm>	 (03CR) 10Steinsplitter: "open since two weeks. can we please go ahead and merge it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234980 (https://phabricator.wikimedia.org/T110869) (owner: 10Dereckson)
[11:17:14] <grrrit-wm>	 (03CR) 10Jcrespo: Add scap scripts to all canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) (owner: 10Jcrespo)
[11:17:35] <wikibugs>	 6operations, 10MediaWiki-General-or-Unknown, 5Patch-For-Review: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1636776 (10Krenair) ```root@deployment-poolcounter01:/var/lib/puppet# ping deployment-puppetmaster PING deployment-puppetmaster.deployment-prep.eqiad.wm...
[11:19:23] <wikibugs>	 6operations, 10Beta-Cluster, 6Labs, 10MediaWiki-General-or-Unknown, 5Patch-For-Review: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1636779 (10Krenair)
[11:19:46] <wikibugs>	 6operations, 10Beta-Cluster, 6Labs, 10MediaWiki-General-or-Unknown: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1636375 (10Krenair)
[11:19:47] <icinga-wm>	 RECOVERY - puppet last run on mw1075 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:21:48] <wikibugs>	 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1636783 (10Pcoombe) @jrobell AFAIK this change won't affect banners. The main sites already use IPv6, this change is only for donatewiki.  Related: it s...
[11:23:47] <grrrit-wm>	 (03CR) 10Matthias Mullie: [C: 031] "AFAICT, references are only fetched:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238111 (https://phabricator.wikimedia.org/T107204) (owner: 10Catrope)
[11:26:53] <wikibugs>	 6operations, 6Labs, 10Salt: salt does not run reliably for toollabs / labs generally - https://phabricator.wikimedia.org/T99213#1636790 (10ArielGlenn) These changes are now live on labstore1001.   Check of instances that don't reply to test.ping now. **The following have 'no route to host' so presumably they...
[11:30:09] <wikibugs>	 6operations, 6Labs, 10Salt: salt does not run reliably for toollabs / labs generally - https://phabricator.wikimedia.org/T99213#1636794 (10ArielGlenn) response fast, all ten hosts, no timeout needed:  ``` root@labcontrol1001:~#  salt  -G 'fqdn:tools-webgrid-lighttpd-12*' cmd.run hostname tools-webgrid-lightt...
[11:30:56] <wikibugs>	 6operations, 6Labs, 10Salt: salt does not run reliably for toollabs / labs generally - https://phabricator.wikimedia.org/T99213#1636802 (10ArielGlenn) 5Open>3Resolved a:3ArielGlenn closing this, opening another ticket specific to the instances that owners must fix.
[11:33:41] <mobrovac>	 !log citoid deploying d569951
[11:33:46] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:36:38] <grrrit-wm>	 (03PS6) 10Muehlenhoff: Raise default conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) 
[11:48:00] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 9 below the confidence bounds
[11:51:19] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 8 below the confidence bounds
[11:54:47] <wikibugs>	 6operations, 7HHVM: /var/cache/hhvm/cli.hhbc.sq3 owned by root on some mw hosts - https://phabricator.wikimedia.org/T112517#1636888 (10Krenair) 3NEW
[11:56:27] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 8 below the confidence bounds
[11:57:47] <wikibugs>	 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#1636896 (10JeroenDeDauw) Is there any ETA on being able to use PHP 5.4 features in WMF deployed extensions yet?
[12:00:42] <wikibugs>	 6operations, 10Datasets-General-or-Unknown: Sometimes (at peak usage?), dumps.wikimedia.org becomes very slow for users (sometimes unresponsive) - https://phabricator.wikimedia.org/T45647#1636908 (10ArielGlenn) I'll see if I can correlate the times to server activity to get a lead on this.
[12:01:39] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 8 below the confidence bounds
[12:03:20] <wikibugs>	 6operations, 10Beta-Cluster, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1636917 (10Krenair) Trusty replacement for tin = mira?
[12:06:40] <icinga-wm>	 PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 8 below the confidence bounds
[12:06:57] <wikibugs>	 6operations, 10Traffic, 7Monitoring, 7Pybal: Implement pybal pool state monitoring and alerting via icinga - https://phabricator.wikimedia.org/T102394#1636919 (10fgiunchedi) an easier option would be for pybal to expose its internal state via http for clients (e.g. icinga checks) to fetch, like e.g. hhvm d...
[12:07:28] <wikibugs>	 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#1636920 (10Krenair) * {T104747} is blocked on {T110707} which is blocked by ops on https://gerrit.wikimedia.org/r/#/c/234699/ * {T94277} is waiting on @ArielGle...
[12:07:56] <wikibugs>	 6operations, 10Traffic, 7Monitoring, 7Pybal: Implement pybal pool state monitoring and alerting via icinga - https://phabricator.wikimedia.org/T102394#1636929 (10Joe) @fgiunchedi I am working on a patch in that direction right now :)
[12:08:12] <wikibugs>	 6operations, 10Traffic, 7Monitoring, 7Pybal: Implement pybal pool state monitoring and alerting via icinga - https://phabricator.wikimedia.org/T102394#1636930 (10Joe) a:3Joe
[12:11:02] <grrrit-wm>	 (03PS7) 10Muehlenhoff: Raise default conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) 
[12:13:30] <grrrit-wm>	 (03CR) 10Muehlenhoff: "I've tested this on the mediawiki instances in deployment-prep and in my ferm test systems in labs; the correct values are set after a reb" [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) (owner: 10Muehlenhoff)
[12:15:54] <Krenair>	 moritzm, I was wondering if you might know what's going on in https://phabricator.wikimedia.org/T112501#1636776 ?
[12:17:24] <wikibugs>	 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1636967 (10BBlack) We haven't had the time to devote to it yet, it's just a matter of scheduling and priorities.
[12:23:48] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] Raise default conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) (owner: 10Muehlenhoff)
[12:24:16] <moritzm>	 Krenair: hmm, it's not caused by ferm rules on either deployment-puppetmaster nor deployment-poolcounter01 (they don't have any), maybe related to the openstack update? I'll have a look at the logs
[12:24:29] <Krenair>	 thanks
[12:24:42] <Krenair>	 I checked and other hosts were successfully connecting to that port
[12:26:09] <_joe_>	 moritzm, Krenair I'm on it
[12:26:27] <_joe_>	 it's clearly a higher-level problem in the cloud network
[12:27:24] <_joe_>	 my guess is andrewbogott might know something more about that
[12:27:42] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: cassandra: enable DC internode encryption for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/237648 (https://phabricator.wikimedia.org/T108953) 
[12:27:50] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: enable DC internode encryption for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/237648 (https://phabricator.wikimedia.org/T108953) (owner: 10Filippo Giunchedi)
[12:28:09] <grrrit-wm>	 (03PS2) 10ArielGlenn: crap salt cleanup scripts primarily for labs use [software] - 10https://gerrit.wikimedia.org/r/236798 
[12:30:20] <grrrit-wm>	 (03PS1) 10Aude: Remove (broken) Wikidata-specific SkinCopyrightFooter hook handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238125 (https://phabricator.wikimedia.org/T112520) 
[12:32:09] <icinga-wm>	 RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected
[12:32:33] <godog>	 !log enable dc encryption on cassandra test cluster and rolling restart
[12:32:38] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:33:40] <grrrit-wm>	 (03PS2) 10Hashar: contint: upgrade setuptools from pypi [puppet] - 10https://gerrit.wikimedia.org/r/234254 (https://phabricator.wikimedia.org/T110506) 
[12:34:13] <grrrit-wm>	 (03CR) 10Hashar: [C: 031 V: 032] "Had this patch on the integration puppetmaster. On creating a new node every went fine and the jobs are properly running." [puppet] - 10https://gerrit.wikimedia.org/r/234254 (https://phabricator.wikimedia.org/T110506) (owner: 10Hashar)
[12:35:30] <wikibugs>	 6operations, 10Traffic, 7HTTPS: Track/notify cert expiries better - https://phabricator.wikimedia.org/T112521#1637061 (10BBlack) 3NEW
[12:39:17] <wikibugs>	 6operations, 10Salt: fix monitor-salt-keys.py to not rotate salt aes keys on deletion - https://phabricator.wikimedia.org/T112522#1637080 (10ArielGlenn) 3NEW a:3ArielGlenn
[12:47:09] <wikibugs>	 6operations, 10Citoid, 6Services, 10Traffic: Remove citoid from parsoidcache - https://phabricator.wikimedia.org/T110476#1637106 (10BBlack) I don't see where that's noted there.  Are you saying there's a reason to keep a separate cxserver.wikimedia.org in the long term, even after making it available via RB?
[12:50:38] <urandom>	 !log starting Cassandra repair on restbase1003 (nodetool repair -pr)
[12:50:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:52:40] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: cassandra: add auxiliary (non-seed) codfw test hosts [puppet] - 10https://gerrit.wikimedia.org/r/238135 (https://phabricator.wikimedia.org/T108613) 
[12:54:45] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Remove (broken) Wikidata-specific SkinCopyrightFooter hook handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238125 (https://phabricator.wikimedia.org/T112520) (owner: 10Aude)
[12:54:52] <grrrit-wm>	 (03Merged) 10jenkins-bot: Remove (broken) Wikidata-specific SkinCopyrightFooter hook handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238125 (https://phabricator.wikimedia.org/T112520) (owner: 10Aude)
[12:55:40] <logmsgbot>	 !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/238125/ (duration: 00m 13s)
[12:55:45] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:56:14] <bblack>	 !log rebooting lvs2006 to test eth hw params stuff...
[12:56:20] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:01:39] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM overall, not sure about the onlyif test" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) (owner: 10Muehlenhoff)
[13:02:47] <aude>	 thanks Krenair  :)
[13:07:04] <wikibugs>	 6operations, 10Citoid, 6Services, 10Traffic: Remove citoid from parsoidcache - https://phabricator.wikimedia.org/T110476#1637147 (10Mvolz) It was mentioned on another task (and I can't currently find the thread!) but @Jdforrester-WMF mentioned that we've been actively encouraging developers to use the cito...
[13:08:54] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "I initially thought we'd want to separate seeds from non-seeds based on DC separation, though clients will cycle through seeds anyways so " [puppet] - 10https://gerrit.wikimedia.org/r/238135 (https://phabricator.wikimedia.org/T108613) (owner: 10Filippo Giunchedi)
[13:09:18] <wikibugs>	 6operations, 10Citoid, 6Services, 10Traffic: Remove citoid from parsoidcache - https://phabricator.wikimedia.org/T110476#1637153 (10mobrovac) >>! In T110476#1637106, @BBlack wrote: > I don't see where that's noted there.   Hm, indeed, it's not. Hm, strange. I remember discussing it with @Jdforrester-WMF on...
[13:13:07] <wikibugs>	 6operations, 10Salt: various salt-minions are not replying to test.ping or commands - https://phabricator.wikimedia.org/T102808#1637169 (10ArielGlenn) I will be looking into this again this week.
[13:15:11] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: cassandra: add codfw test nodes [puppet] - 10https://gerrit.wikimedia.org/r/238138 (https://phabricator.wikimedia.org/T108613) 
[13:16:05] <grrrit-wm>	 (03Abandoned) 10Filippo Giunchedi: cassandra: add auxiliary (non-seed) codfw test hosts [puppet] - 10https://gerrit.wikimedia.org/r/238135 (https://phabricator.wikimedia.org/T108613) (owner: 10Filippo Giunchedi)
[13:22:40] <grrrit-wm>	 (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/238138 (https://phabricator.wikimedia.org/T108613) (owner: 10Filippo Giunchedi)
[13:23:13] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add codfw test nodes [puppet] - 10https://gerrit.wikimedia.org/r/238138 (https://phabricator.wikimedia.org/T108613) (owner: 10Filippo Giunchedi)
[13:26:48] <wikibugs>	 6operations, 10Traffic, 5Patch-For-Review: Re-investigate eth params on jessie LVS nodes - https://phabricator.wikimedia.org/T110530#1637197 (10BBlack) GRO and LRO seem fine.  Still facing an issue with both the rxring parameters and the interface-rps parameters.  They can both be applied successfully post-b...
[13:31:09] <ShakespeareFan00>	 Hi
[13:31:31] <ShakespeareFan00>	 Is there anything that can be said about yesterday's "incident" yet?
[13:33:19] <wikibugs>	 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1637211 (10Jgreen) a:5Jgreen>3BBlack
[13:34:31] <wikibugs>	 6operations, 10fundraising-tech-ops: reformulate kafkatee package to work with Trusty - https://phabricator.wikimedia.org/T110591#1637217 (10Jgreen) p:5Normal>3High
[13:37:54] <wikibugs>	 6operations, 10Traffic, 10fundraising-tech-ops, 7IPv6, 5Patch-For-Review: Enable IPv6 on donate.wikimedia.org - https://phabricator.wikimedia.org/T73267#1637231 (10jrobell) Thank you @Pcoombe, as this doesn't seem to affect banner campaigns, the planned Luxembourg and Belgium campaign just went up at 1.3...
[13:38:29] <godog>	 !log stop puppet on restbase-test2001 and turn up cassandra
[13:38:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:38:39] <wikibugs>	 6operations, 10fundraising-tech-ops: package udp-filter for Trusty, for use on fundraising banner_logger - https://phabricator.wikimedia.org/T110592#1637236 (10Jgreen)
[13:39:22] <wikibugs>	 6operations, 10fundraising-tech-ops: package udp-filter for Trusty, for use on fundraising banner_logger - https://phabricator.wikimedia.org/T110592#1581705 (10Jgreen)
[13:39:23] <wikibugs>	 6operations, 10fundraising-tech-ops: build libanon package for trusty - https://phabricator.wikimedia.org/T110739#1637238 (10Jgreen) 5Open>3Resolved builds now
[13:41:18] <icinga-wm>	 RECOVERY - Cassandra database on restbase-test2001 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon
[13:42:18] <icinga-wm>	 RECOVERY - Cassanda CQL query interface on restbase-test2001 is OK: TCP OK - 0.036 second response time on port 9042
[13:47:01] <wikibugs>	 6operations, 10fundraising-tech-ops: package udp-filter for Trusty, for use on fundraising banner_logger - https://phabricator.wikimedia.org/T110592#1637249 (10Jgreen)
[13:48:27] <wikibugs>	 6operations, 10Traffic, 5Patch-For-Review: Re-investigate eth params on jessie LVS nodes - https://phabricator.wikimedia.org/T110530#1637252 (10BBlack) Digging a little further in syslogs, apparently it is a race.  systemd ends up trying to configure eth[12] first and they fail the RSS IRQ pattern check, and...
[13:52:07] <wikibugs>	 6operations, 10fundraising-tech-ops: package udp-filter for Trusty, for use on fundraising banner_logger - https://phabricator.wikimedia.org/T110592#1637263 (10Jgreen) 5Open>3Resolved compiles/builds fine now
[14:00:44] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: cassandra: enable ssl_storage_port (7001) in ferm [puppet] - 10https://gerrit.wikimedia.org/r/238144 (https://phabricator.wikimedia.org/T108953) 
[14:01:45] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: cassandra: enable ssl_storage_port (7001) in ferm [puppet] - 10https://gerrit.wikimedia.org/r/238144 (https://phabricator.wikimedia.org/T108953) 
[14:02:10] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: enable ssl_storage_port (7001) in ferm [puppet] - 10https://gerrit.wikimedia.org/r/238144 (https://phabricator.wikimedia.org/T108953) (owner: 10Filippo Giunchedi)
[14:02:52] <grrrit-wm>	 (03PS3) 10Ottomata: Create ferm rules for Hadoop NameNode and ResourceManager for master and standby [puppet] - 10https://gerrit.wikimedia.org/r/237335 (owner: 10Muehlenhoff)
[14:14:48] <icinga-wm>	 RECOVERY - Cassandra database on restbase-test2003 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon
[14:15:18] <icinga-wm>	 RECOVERY - Cassanda CQL query interface on restbase-test2003 is OK: TCP OK - 0.034 second response time on port 9042
[14:15:57] <wikibugs>	 6operations: Undo phab.wikidata.org hacks after wmfusercontent.org cert is fixed - https://phabricator.wikimedia.org/T112381#1637334 (10csteipp) If we ever do this again in the future, let's use a wikimedia.org domain ( longer history of segmenting untrusted subdomains).  @Bblack, do you have an eta from GlobalS...
[14:16:08] <icinga-wm>	 RECOVERY - Cassandra database on restbase-test2002 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon
[14:16:19] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 15.38% of data above the critical threshold [500.0]
[14:17:37] <icinga-wm>	 RECOVERY - Cassanda CQL query interface on restbase-test2002 is OK: TCP OK - 0.034 second response time on port 9042
[14:17:48] <icinga-wm>	 PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0]
[14:20:10] <wikibugs>	 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1637354 (10fgiunchedi) ok cassandra is up in codfw with encryption enabled and `auto_bootstrap: false` so codfw and eqiad are seeing each other (step #4)...
[14:26:48] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [5000000.0]
[14:27:27] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[14:30:58] <icinga-wm>	 PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:30:58] <icinga-wm>	 PROBLEM - Restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:31:28] <_joe_>	 godog: ^^
[14:32:08] <icinga-wm>	 PROBLEM - Restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:32:20] <godog>	 looking, thanks _joe_ 
[14:37:08] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0]
[14:38:47] <icinga-wm>	 RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy
[14:39:19] <icinga-wm>	 RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy
[14:39:19] <icinga-wm>	 RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy
[14:41:10] <godog>	 there's some 5xx alerts showing up for restbase in the dashboard, looking at those as well
[14:41:17] <wikibugs>	 6operations, 10ops-codfw, 10netops: cr1-eqdfw PEM 0 failure - https://phabricator.wikimedia.org/T110435#1637414 (10Papaul) RMA create..  Hello Papaul   The RMA is already done, the order # is R395890, The local logistics department will receive the request and will proceed from here, I will like you to keep...
[14:43:57] <icinga-wm>	 PROBLEM - Restbase endpoints health on cerium is CRITICAL: /page/html/{title} is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/data-parsoid/{title} is CRITICAL: Test Get data-parsoid by title returned the unexpected status 500 (expecting: 200): /page/title/{title} is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /page/revisio
[14:44:37] <icinga-wm>	 PROBLEM - Restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:44:38] <icinga-wm>	 PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:46:09] <icinga-wm>	 RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy
[14:46:21] <grrrit-wm>	 (03PS1) 10ArielGlenn: labs key monitor/delete script: don't rotate saltmaster aes key on key deletion [puppet] - 10https://gerrit.wikimedia.org/r/238151 
[14:48:21] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: Add instrumentation [debs/pybal] - 10https://gerrit.wikimedia.org/r/238152 (https://phabricator.wikimedia.org/T102394) 
[14:48:57] <wikibugs>	 6operations, 10ops-codfw: ms-be2006.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T112242#1637427 (10Papaul) I will have replacement drive on site tomorrow.
[14:49:19] <wikibugs>	 6operations, 10ops-eqiad, 10Traffic, 10netops: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1637428 (10faidon)
[14:51:27] <icinga-wm>	 PROBLEM - Restbase endpoints health on xenon is CRITICAL: /page/html/{title} is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/data-parsoid/{title} is CRITICAL: Test Get data-parsoid by title returned the unexpected status 500 (expecting: 200): /page/title/{title} is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /page/revision
[14:52:26] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: rsync the diff since mail was held on sodium - https://phabricator.wikimedia.org/T110138#1637435 (10Dzahn) one more time, about 20 hours later  sent 5435891 bytes  received 10238 bytes  37952.12 bytes/sec total size is 2837146704  speedup is 520.95...
[14:52:54] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] labs key monitor/delete script: don't rotate saltmaster aes key on key deletion [puppet] - 10https://gerrit.wikimedia.org/r/238151 (owner: 10ArielGlenn)
[14:53:37] <wikibugs>	 6operations, 7Availability, 7Monitoring: Monitor MediaWiki sessions - https://phabricator.wikimedia.org/T108985#1637437 (10chasemp)
[14:53:37] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Upgrade labvirt1007 to kilo [puppet] - 10https://gerrit.wikimedia.org/r/238154 
[14:54:27] <wikibugs>	 6operations, 6Discovery, 10Wikimedia-Logstash, 7Elasticsearch, 7Graphite: Deploy statsd plugin for production elasticsearch & logstash - https://phabricator.wikimedia.org/T90889#1637438 (10chasemp) We are going to try on https://phabricator.wikimedia.org/T111573 first I think
[14:55:09] <wikibugs>	 6operations, 10Beta-Cluster, 6Labs, 10MediaWiki-General-or-Unknown: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1637443 (10Andrew) I see this problem and can reproduce it on another instance.  No idea as to the cause yet.
[14:55:18] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Upgrade labvirt1007 to kilo [puppet] - 10https://gerrit.wikimedia.org/r/238154 
[14:56:22] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Upgrade labvirt1007 to kilo [puppet] - 10https://gerrit.wikimedia.org/r/238154 (owner: 10Andrew Bogott)
[14:57:11] <wikibugs>	 6operations, 10Traffic, 5Patch-For-Review: Re-investigate eth params on jessie LVS nodes - https://phabricator.wikimedia.org/T110530#1637444 (10BBlack) So, basically this is a race centered around bnx2x->udev->systemd event notifications and /e/n/i up-commands that set hardware parameters.  It's probably goi...
[14:57:33] <wikibugs>	 6operations, 10Traffic, 5Patch-For-Review: Re-investigate eth params on jessie LVS nodes - https://phabricator.wikimedia.org/T110530#1637449 (10BBlack)
[14:57:49] <wikibugs>	 6operations, 10Traffic, 5Patch-For-Review: Re-investigate eth params on jessie LVS nodes - https://phabricator.wikimedia.org/T110530#1580041 (10BBlack)
[14:57:50] <wikibugs>	 6operations, 10ops-eqiad, 10Traffic, 10netops: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1637451 (10BBlack)
[15:00:04] <jouncebot>	 anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150914T1500). Please do the needful.
[15:00:04] <jouncebot>	 Krenair bawolff: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[15:00:11] <Krenair>	 hey
[15:00:18] <bawolff>	 hi
[15:00:36] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Add *.ggpht.com to Wikimedia Commons upload whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234980 (https://phabricator.wikimedia.org/T110869) (owner: 10Dereckson)
[15:01:04] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add *.ggpht.com to Wikimedia Commons upload whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/234980 (https://phabricator.wikimedia.org/T110869) (owner: 10Dereckson)
[15:01:23] <bawolff>	 Woo!
[15:01:59] <logmsgbot>	 !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/234980/ (duration: 00m 12s)
[15:02:05] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:02:38] <icinga-wm>	 PROBLEM - DPKG on lvs1005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[15:02:48] <icinga-wm>	 PROBLEM - DPKG on lvs3003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[15:02:58] <icinga-wm>	 PROBLEM - DPKG on lvs3004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[15:02:58] <icinga-wm>	 PROBLEM - DPKG on lvs3002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[15:02:58] <icinga-wm>	 PROBLEM - DPKG on lvs1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[15:03:17] <icinga-wm>	 PROBLEM - DPKG on lvs1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[15:03:18] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Revert "Add interwiki-labs.cdb" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237529 (owner: 10Alex Monk)
[15:03:39] <grrrit-wm>	 (03Merged) 10jenkins-bot: Revert "Add interwiki-labs.cdb" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237529 (owner: 10Alex Monk)
[15:03:50] <wikibugs>	 6operations, 10Salt: check usage of salt-key delete everywhere - https://phabricator.wikimedia.org/T112534#1637484 (10ArielGlenn) 3NEW a:3ArielGlenn
[15:03:59] <icinga-wm>	 PROBLEM - DPKG on lvs1006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[15:03:59] <icinga-wm>	 PROBLEM - DPKG on lvs1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[15:04:08] <icinga-wm>	 PROBLEM - DPKG on lvs3001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[15:04:17] <icinga-wm>	 PROBLEM - DPKG on lvs1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[15:04:26] <bblack>	 ^ that's me
[15:04:32] <wikibugs>	 6operations, 10Salt: fix monitor-salt-keys.py to not rotate salt aes keys on deletion - https://phabricator.wikimedia.org/T112522#1637495 (10ArielGlenn) 5Open>3Resolved https://gerrit.wikimedia.org/r/#/c/238151/ tested, merged and deployed. see related task T112534
[15:04:48] <logmsgbot>	 !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/237529/ (duration: 00m 11s)
[15:04:53] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:05:23] <logmsgbot>	 !log krenair@tin Synchronized docroot/noc: https://gerrit.wikimedia.org/r/#/c/237529/ (duration: 00m 12s)
[15:05:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:05:46] <wikibugs>	 6operations, 10Beta-Cluster, 6Labs, 10MediaWiki-General-or-Unknown: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1637501 (10Andrew) This appears to be yet another issue with the nova rolling-upgrade process.  The new instance, deployment-puppetmaster, was run...
[15:05:47] <logmsgbot>	 !log krenair@tin Synchronized .gitignore: https://gerrit.wikimedia.org/r/#/c/237529/ (duration: 00m 13s)
[15:05:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:06:17] <wikibugs>	 6operations, 10Beta-Cluster, 6Labs, 10MediaWiki-General-or-Unknown: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1637502 (10Andrew)
[15:07:27] <icinga-wm>	 RECOVERY - DPKG on lvs1006 is OK: All packages OK
[15:07:57] <icinga-wm>	 RECOVERY - DPKG on lvs3003 is OK: All packages OK
[15:08:23] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: rsync the diff since mail was held on sodium - https://phabricator.wikimedia.org/T110138#1637520 (10JohnLewis) so we're seeing static values below 2 hours but not less than an hour and forty minutes? Still seems high but if we are directly rsyncing,...
[15:09:48] <icinga-wm>	 RECOVERY - DPKG on lvs3004 is OK: All packages OK
[15:09:48] <icinga-wm>	 RECOVERY - DPKG on lvs3002 is OK: All packages OK
[15:09:55] <wikibugs>	 6operations, 10Beta-Cluster, 6Labs, 10MediaWiki-General-or-Unknown: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1637526 (10Krenair) Signed and puppet successfully ran on deployment-poolcounter01.deployment-prep.eqiad.wmflabs
[15:11:39] <icinga-wm>	 RECOVERY - DPKG on lvs1002 is OK: All packages OK
[15:12:00] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [500.0]
[15:12:38] <icinga-wm>	 RECOVERY - DPKG on lvs3001 is OK: All packages OK
[15:13:22] <wikibugs>	 6operations, 10Traffic, 5Patch-For-Review: Fix ethernet startup race on HP LVS w/ jessie - https://phabricator.wikimedia.org/T110530#1637531 (10BBlack)
[15:13:46] <wikibugs>	 6operations: Change distribution in releases.wikimedia.org to "sid" or "jessie" - https://phabricator.wikimedia.org/T111225#1637536 (10GWicke) In the meantime, I pushed 0.4 to releases.wikimedia.org, but before we can switch to that {T111225} will need to be resolved.
[15:14:16] <wikibugs>	 6operations: Change distribution in releases.wikimedia.org to "sid" or "jessie" - https://phabricator.wikimedia.org/T111225#1637537 (10GWicke) p:5Low>3High
[15:14:49] <icinga-wm>	 RECOVERY - DPKG on lvs1001 is OK: All packages OK
[15:16:07] <icinga-wm>	 RECOVERY - DPKG on lvs1003 is OK: All packages OK
[15:16:17] <icinga-wm>	 RECOVERY - DPKG on lvs1005 is OK: All packages OK
[15:17:37] <icinga-wm>	 RECOVERY - DPKG on lvs1004 is OK: All packages OK
[15:17:45] <wikibugs>	 6operations: Change distribution in releases.wikimedia.org to "sid" or "jessie" - https://phabricator.wikimedia.org/T111225#1637548 (10GWicke) Upped the priority to 'high', as this is blocking the move to the official repository. Since the labs repository is still broken pending some file restoration, we current...
[15:18:24] <wikibugs>	 6operations: Change distribution in releases.wikimedia.org to "sid" or "jessie" - https://phabricator.wikimedia.org/T111225#1637550 (10GWicke)
[15:18:52] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Move default openstack version to Kilo [puppet] - 10https://gerrit.wikimedia.org/r/238158 
[15:20:09] <RoanKattouw>	 Krenair: Can I sneak another change into the SWAT window? I can deploy it myself
[15:20:18] <RoanKattouw>	 (Cherry-pick of https://gerrit.wikimedia.org/r/238115 once it merges)
[15:20:22] <Krenair>	 At this point it's not exactly "sneaking", but sure :
[15:20:23] <Krenair>	 :)
[15:20:29] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[15:23:08] <icinga-wm>	 PROBLEM - DPKG on lvs3003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[15:23:14] <grrrit-wm>	 (03CR) 10Nuria: Set replace=True for EventLogging MySQL consumer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/237688 (https://phabricator.wikimedia.org/T112265) (owner: 10Ottomata)
[15:23:18] <icinga-wm>	 PROBLEM - DPKG on lvs1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[15:23:18] <icinga-wm>	 PROBLEM - DPKG on lvs3004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[15:24:18] <icinga-wm>	 PROBLEM - puppet last run on lvs1001 is CRITICAL: CRITICAL: Puppet has 1 failures
[15:24:18] <icinga-wm>	 PROBLEM - DPKG on lvs1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[15:24:28] <icinga-wm>	 PROBLEM - DPKG on lvs3001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[15:24:35] <RoanKattouw>	 Yeah I guess it's a bit late
[15:24:58] <icinga-wm>	 RECOVERY - DPKG on lvs1001 is OK: All packages OK
[15:26:07] <icinga-wm>	 RECOVERY - DPKG on lvs1004 is OK: All packages OK
[15:26:38] <icinga-wm>	 RECOVERY - DPKG on lvs3003 is OK: All packages OK
[15:26:48] <icinga-wm>	 RECOVERY - DPKG on lvs3004 is OK: All packages OK
[15:26:54] <grrrit-wm>	 (03PS2) 10Lokal Profil: Localisation updates from translatewiki.net [puppet] - 10https://gerrit.wikimedia.org/r/229136 
[15:27:55] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Move default openstack version to Kilo [puppet] - 10https://gerrit.wikimedia.org/r/238158 (owner: 10Andrew Bogott)
[15:29:40] <grrrit-wm>	 (03PS1) 10BBlack: switch lvs[34]00x installer to jessie [puppet] - 10https://gerrit.wikimedia.org/r/238161 (https://phabricator.wikimedia.org/T96375) 
[15:30:38] <icinga-wm>	 RECOVERY - Disk space on labstore1002 is OK: DISK OK
[15:32:04] <grrrit-wm>	 (03CR) 10Lokal Profil: "The latest patch set is only an update to include translations done since the first patch set was sent for review." [puppet] - 10https://gerrit.wikimedia.org/r/229136 (owner: 10Lokal Profil)
[15:33:08] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] switch lvs[34]00x installer to jessie [puppet] - 10https://gerrit.wikimedia.org/r/238161 (https://phabricator.wikimedia.org/T96375) (owner: 10BBlack)
[15:34:45] <logmsgbot>	 !log catrope@tin Synchronized php-1.26wmf22/extensions/Echo/: SWAT (duration: 00m 13s)
[15:34:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:37:23] <wikibugs>	 6operations, 10Beta-Cluster, 6Labs, 10MediaWiki-General-or-Unknown: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1637622 (10Andrew)
[15:37:33] <wikibugs>	 6operations, 10MediaWiki-General-or-Unknown, 5Patch-For-Review: Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378#1637625 (10Andrew)
[15:37:34] <wikibugs>	 6operations, 10Beta-Cluster, 6Labs, 10MediaWiki-General-or-Unknown: Create a poolcounter instance in deployment-prep - https://phabricator.wikimedia.org/T112501#1637624 (10Andrew) 5Open>3Resolved
[15:37:39] <grrrit-wm>	 (03PS1) 10ArielGlenn: for wmf reimage script, don't rotate saltmaster aes key on minion key deletion [puppet] - 10https://gerrit.wikimedia.org/r/238164 
[15:38:04] <wikibugs>	 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1637628 (10jcrespo)
[15:39:03] <bblack>	 !log reinstalling lvs4003, lvs4003 (jessie upgrade: T96375)
[15:39:08] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:39:15] <jynus>	 I think we are storing every git commit on phabricator
[15:39:17] <bblack>	 !log reinstalling lvs4003, lvs4004 (jessie upgrade: T96375) (typo earlier)
[15:39:21] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:39:32] <grrrit-wm>	 (03PS2) 10Tim Landscheidt: Tools: Migrate from labsdebrepo to aptly [puppet] - 10https://gerrit.wikimedia.org/r/238089 
[15:40:41] <grrrit-wm>	 (03CR) 10Tim Landscheidt: "(PS2: Use apt::pin instead of a file resource.)" [puppet] - 10https://gerrit.wikimedia.org/r/238089 (owner: 10Tim Landscheidt)
[15:42:31] <wikibugs>	 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1637656 (10jcrespo) I've reopened T110913 to include the profiling I did on phabricator during the weekend. Some scary things there (in terms of performance).
[15:43:36] <wikibugs>	 6operations: Unable to connect to deployment-eventlogging02.eqiad.wmflabs - https://phabricator.wikimedia.org/T112540#1637657 (10Mholloway) 3NEW
[15:43:57] <wikibugs>	 6operations: Unable to connect to deployment-eventlogging02.eqiad.wmflabs - https://phabricator.wikimedia.org/T112540#1637669 (10Mholloway)
[15:44:08] <icinga-wm>	 RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy
[15:44:57] <icinga-wm>	 RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy
[15:44:57] <icinga-wm>	 RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy
[15:46:23] <godog>	 !log switch to openjdk-8 and bounce cassandra on restbase-test200*
[15:46:30] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:46:45] <wikibugs>	 6operations, 10Salt: check usage of salt-key delete everywhere - https://phabricator.wikimedia.org/T112534#1637676 (10ArielGlenn) https://gerrit.wikimedia.org/r/#/c/238164/  key rotation not needed for host re-imaging, the 24 hour rotation is good enough
[15:50:52] <wikibugs>	 6operations, 10Salt: check usage of salt-key delete everywhere - https://phabricator.wikimedia.org/T112534#1637690 (10ArielGlenn) looking at https://gerrit.wikimedia.org/r/#/c/48983/ it seems that auth.sls would be deleting keys almost never. so we can leave that script alone.
[15:51:28] <icinga-wm>	 RECOVERY - puppet last run on lvs1001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[15:52:15] <grrrit-wm>	 (03PS8) 10Muehlenhoff: Raise default conntrack table size [puppet] - 10https://gerrit.wikimedia.org/r/237389 (https://phabricator.wikimedia.org/T105307) 
[15:53:07] <wikibugs>	 10Ops-Access-Requests, 6operations: Unable to connect to deployment-eventlogging02.eqiad.wmflabs - https://phabricator.wikimedia.org/T112540#1637712 (10Mholloway)
[15:56:03] <mutante>	 product duty never changes, does it
[15:56:18] <icinga-wm>	 PROBLEM - nova-scheduler process on labcontrol1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-scheduler
[15:56:26] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Openstack: Return the nova scheduler pool to normal. [puppet] - 10https://gerrit.wikimedia.org/r/238169 
[15:59:37] <SPF|Cloud>	 mutante: perhaps it did a few years ago ;)
[16:00:15] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Openstack: Return the nova scheduler pool to normal. [puppet] - 10https://gerrit.wikimedia.org/r/238169 (owner: 10Andrew Bogott)
[16:02:22] <JohnFLewis>	 SPF|Cloud: impossible! its only existed about a year (if not less)
[16:03:59] <wikibugs>	 7Puppet, 6Analytics-Backlog, 10Analytics-Wikimetrics: Cleanup Wikimetrics puppet module so it can run puppet continuously without own puppetmaster {dove} - https://phabricator.wikimedia.org/T101763#1637743 (10madhuvishy)
[16:05:58] <wikibugs>	 6operations: audit all ssh certificates expiry on ops tracking gcal - https://phabricator.wikimedia.org/T112542#1637752 (10RobH) 3NEW a:3RobH
[16:06:15] <ottomata>	 !log stopping hdfs journalnode on analytics1011 to copy journal edits to new journalnodes on analytics1035 and analytics1052
[16:06:21] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:06:57] <JohnFLewis>	 robh: did you mean SSL? :)
[16:07:12] <wikibugs>	 6operations: audit all ssl certificates expiry on ops tracking gcal - https://phabricator.wikimedia.org/T112542#1637763 (10RobH)
[16:07:13] <robh>	 yes
[16:07:17] <wikibugs>	 6operations: audit all SSL certificates expiry on ops tracking gcal - https://phabricator.wikimedia.org/T112542#1637765 (10Krenair)
[16:08:51] <wikibugs>	 6operations, 6Performance-Team: New URL scheme for service-generated thumbnails - https://phabricator.wikimedia.org/T111048#1637782 (10Gilles) 5Open>3Invalid After re-reading the IIIF spec, it seems way too large a standard to support. It requires supporting many formats and filters that thumbor doesn't. I...
[16:08:51] <Nemo_bis>	 well, we can change
[16:09:34] <grrrit-wm>	 (03PS1) 10Ottomata: Adding new journalnodes in prep for decomissioning analytics1011 and analytics1019 [puppet] - 10https://gerrit.wikimedia.org/r/238173 (https://phabricator.wikimedia.org/T112113) 
[16:10:12] <wikibugs>	 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1637794 (10mmodell) Doesn't look too scary to me, can you elaborate?
[16:10:47] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Adding new journalnodes in prep for decomissioning analytics1011 and analytics1019 [puppet] - 10https://gerrit.wikimedia.org/r/238173 (https://phabricator.wikimedia.org/T112113) (owner: 10Ottomata)
[16:12:32] <wikibugs>	 6operations: audit all SSL certificates expiry on ops tracking gcal - https://phabricator.wikimedia.org/T112542#1637820 (10RobH) this is changing scope to a checklist for all ssl certificate purchases and how to review and audit
[16:12:42] <logmsgbot>	 !log catrope@tin Synchronized php-1.26wmf22/extensions/Echo/: For real this time (duration: 00m 11s)
[16:12:47] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:13:13] <icinga-wm>	 PROBLEM - Restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:13:13] <icinga-wm>	 PROBLEM - Restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:14:23] <icinga-wm>	 PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:15:05] <icinga-wm>	 RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy
[16:15:05] <icinga-wm>	 RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy
[16:16:14] <icinga-wm>	 RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy
[16:22:14] <icinga-wm>	 PROBLEM - configured eth on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:22:53] <icinga-wm>	 PROBLEM - salt-minion processes on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:22:54] <icinga-wm>	 PROBLEM - SSH on mw1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:23:08] <grrrit-wm>	 (03PS1) 10Ottomata: Update net-topology.py for Hadoop so that the default rack has the same hierarchy as real nodes [puppet] - 10https://gerrit.wikimedia.org/r/238179 
[16:23:11] <grrrit-wm>	 (03CR) 10Mdann52: [C: 031] noindex user namespace on en.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237330 (https://phabricator.wikimedia.org/T104797) (owner: 10Mdann52)
[16:23:33] <icinga-wm>	 PROBLEM - puppet last run on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:23:34] <icinga-wm>	 PROBLEM - nutcracker port on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:23:44] <icinga-wm>	 PROBLEM - DPKG on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:23:45] <icinga-wm>	 PROBLEM - RAID on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:24:45] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Update net-topology.py for Hadoop so that the default rack has the same hierarchy as real nodes [puppet] - 10https://gerrit.wikimedia.org/r/238179 (owner: 10Ottomata)
[16:24:53] <icinga-wm>	 RECOVERY - salt-minion processes on mw1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[16:27:08] <JohnFLewis>	 Nemo_bis: wny did you remove it?
[16:27:44] <wikibugs>	 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1637867 (10Eevans)
[16:27:46] <JohnFLewis>	 it's there for a reason :) and unless someone from product said it doesn't need to be there; it should be.
[16:28:13] <icinga-wm>	 PROBLEM - Hadoop NameNode Primary Is Active on analytics1001 is CRITICAL: Hadoop.NameNode.FSNamesystem.tag_HAState CRITICAL: standby
[16:28:14] <icinga-wm>	 PROBLEM - Disk space on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:28:46] <ottomata>	 ^^^ this is ok
[16:28:55] <ottomata>	 i'm moving restarting namenodes to get a change in
[16:29:07] <wikibugs>	 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1524920 (10Eevans)
[16:29:35] <icinga-wm>	 ACKNOWLEDGEMENT - Hadoop NameNode Primary Is Active on analytics1001 is CRITICAL: Hadoop.NameNode.FSNamesystem.tag_HAState CRITICAL: standby ottomata restarting namenodes
[16:31:03] <icinga-wm>	 PROBLEM - dhclient process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:32:14] <icinga-wm>	 PROBLEM - nutcracker process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:32:24] <wikibugs>	 6operations, 5Patch-For-Review, 7Pybal: jessie pybals get restarted every day by logrotate, resetting BGP sessions - https://phabricator.wikimedia.org/T112457#1637893 (10BBlack)
[16:32:25] <icinga-wm>	 PROBLEM - salt-minion processes on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:32:25] <wikibugs>	 6operations, 10Traffic, 5Patch-For-Review: Upgrade codfw,ulsfo,esams LVS to jessie - https://phabricator.wikimedia.org/T96375#1637892 (10BBlack)
[16:32:44] <wikibugs>	 6operations, 10Traffic, 5Patch-For-Review: Upgrade codfw,ulsfo,esams LVS to jessie - https://phabricator.wikimedia.org/T96375#1215460 (10BBlack)
[16:33:24] <icinga-wm>	 RECOVERY - dhclient process on mw1006 is OK: PROCS OK: 0 processes with command name dhclient
[16:33:28] <wikibugs>	 10Ops-Access-Requests, 6operations: Unable to connect to deployment-eventlogging02.eqiad.wmflabs - https://phabricator.wikimedia.org/T112540#1637899 (10Krenair) wmflabs -> removing #operations and #ops-access-requests
[16:33:43] <icinga-wm>	 RECOVERY - nutcracker port on mw1006 is OK: TCP OK - 0.000 second response time on port 11212
[16:33:43] <icinga-wm>	 RECOVERY - Hadoop NameNode Primary Is Active on analytics1001 is OK: Hadoop.NameNode.FSNamesystem.tag_HAState OKAY: active
[16:33:44] <icinga-wm>	 RECOVERY - configured eth on mw1006 is OK: OK - interfaces up
[16:37:40] <Nemo_bis>	 JohnFLewis: I think you missed some steps, but do as you prefer
[16:38:27] <JohnFLewis>	 Nemo_bis: in such a meta case; it's better to leave it and discuss than reverse. you never know - during the time its not there someone may actually need a product duty guy :)
[16:39:54] <icinga-wm>	 PROBLEM - configured eth on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:39:55] <icinga-wm>	 PROBLEM - nutcracker port on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:40:42] <grrrit-wm>	 (03PS1) 10Ottomata: Add analytics1053 and 1057 to Hadoop net-topology.py [puppet] - 10https://gerrit.wikimedia.org/r/238181 
[16:41:38] <grrrit-wm>	 (03PS2) 10Ottomata: Add analytics1053 and 1057 to Hadoop net-topology.py [puppet] - 10https://gerrit.wikimedia.org/r/238181 
[16:41:44] <icinga-wm>	 PROBLEM - dhclient process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:43:43] <icinga-wm>	 RECOVERY - dhclient process on mw1006 is OK: PROCS OK: 0 processes with command name dhclient
[16:43:58] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Add analytics1053 and 1057 to Hadoop net-topology.py [puppet] - 10https://gerrit.wikimedia.org/r/238181 (owner: 10Ottomata)
[16:44:41] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Request to access apertium-apy service restart - https://phabricator.wikimedia.org/T111360#1637973 (10RobH) This was approved in the ops meeting just now, so implementation to follow.  (I'm merely noting the approval in the meeting on task.)
[16:47:33] <icinga-wm>	 RECOVERY - Disk space on mw1006 is OK: DISK OK
[16:47:55] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] nodepool: sudo rules for contint-admins (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/235742 (https://phabricator.wikimedia.org/T111374) (owner: 10Hashar)
[16:48:29] <nuria>	 jelouuu, does anyone know what version of logstash do we have deployed?
[16:49:54] <icinga-wm>	 PROBLEM - dhclient process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:50:13] <icinga-wm>	 PROBLEM - Hadoop NodeManager on analytics1016 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager
[16:50:23] <YuviPanda>	 nuria: _808db would know
[16:50:31] <YuviPanda>	 oh I Just realized that's the reverse of his regular name
[16:51:09] <nuria>	 _808db: hello
[16:52:54] <ottomata>	 that's me too, aye yai yai yarn man
[16:53:24] <nuria>	 me no comprendou
[16:53:34] <icinga-wm>	 PROBLEM - Disk space on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:54:29] <grrrit-wm>	 (03PS1) 10Ottomata: Remove analytics1011, 1016, and 1019 as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/238185 (https://phabricator.wikimedia.org/T112113) 
[16:56:33] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Remove analytics1011, 1016, and 1019 as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/238185 (https://phabricator.wikimedia.org/T112113) (owner: 10Ottomata)
[16:56:36] <Krenair>	 nuria, I think _808db is away
[16:57:04] <Krenair>	 phabricator says for two weeks
[17:01:55] <_joe_>	 ottomata: btw, I'm back and able for interviews :)
[17:04:22] <ottomata>	 _joe_:  ok awesome
[17:04:28] <ottomata>	 did you summit all the mountains?
[17:04:49] <_joe_>	 ottomata: ahah yeah
[17:05:55] <_joe_>	 ottomata: I was here http://www.kastra.eu/pics/mystra18.jpg when I got your message :)
[17:06:33] <ottomata>	 OooO
[17:08:26] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 9.09% of data above the critical threshold [500.0]
[17:12:05] <wikibugs>	 6operations, 10ops-eqiad, 6Labs, 3Labs-Sprint-114, 3ToolLabs-Goals-Q4: Make certain ports and cables between the labstores and shelves are numbered/named and labeled, and make sure that the diagram(s) reflect that. - https://phabricator.wikimedia.org/T112549#1638094 (10coren) 3NEW a:3coren
[17:13:23] <grrrit-wm>	 (03PS1) 10Tim Landscheidt: shinken: Make shinkengen compatible with ldap3 0.9.4.2 [puppet] - 10https://gerrit.wikimedia.org/r/238190 (https://phabricator.wikimedia.org/T101824) 
[17:15:25] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[17:17:08] <grrrit-wm>	 (03PS10) 10Ori.livneh: Send image varnish frontend data from logs to statsd [puppet] - 10https://gerrit.wikimedia.org/r/234157 (https://phabricator.wikimedia.org/T105681) (owner: 10Gilles)
[17:18:21] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Send image varnish frontend data from logs to statsd [puppet] - 10https://gerrit.wikimedia.org/r/234157 (https://phabricator.wikimedia.org/T105681) (owner: 10Gilles)
[17:19:41] <ori>	 argh, upload varnishes will complain about puppet failures in a moment
[17:19:41] <ori>	 sorry
[17:19:50] <wikibugs>	 6operations, 10fundraising-tech-ops: reformulate kafkatee package to work with Trusty - https://phabricator.wikimedia.org/T110591#1638171 (10Jgreen) >>! In T110591#1610291, @Ottomata wrote: > Done https://gerrit.wikimedia.org/r/#/c/236066/ >  > http://apt.wikimedia.org/wikimedia/pool/main/k/kafkatee/  Minor bu...
[17:21:27] <grrrit-wm>	 (03CR) 10Tim Landscheidt: "Tested this on shinken-test8-scfc against the backport and on shinken-01 against the current live one." [puppet] - 10https://gerrit.wikimedia.org/r/238190 (https://phabricator.wikimedia.org/T101824) (owner: 10Tim Landscheidt)
[17:21:30] <ori>	 oh, no, they won't
[17:21:36] <ori>	 the puppet failure i saw was some race condition
[17:27:17] <grrrit-wm>	 (03PS1) 10Chad: Minor tweaks to my .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/238191 
[17:28:17] <wikibugs>	 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1638205 (10jcrespo) Maybe inserting all the commits on the 16GB blob table is one of the reason of slowdowns (pure speculation).
[17:28:33] <ostriches>	 jouncebot: next
[17:28:33] <jouncebot>	 In 2 hour(s) and 31 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150914T2000)
[17:30:47] <grrrit-wm>	 (03PS1) 10BBlack: update star.wmfusercontent.org cert [puppet] - 10https://gerrit.wikimedia.org/r/238192 
[17:31:00] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] update star.wmfusercontent.org cert [puppet] - 10https://gerrit.wikimedia.org/r/238192 (owner: 10BBlack)
[17:35:06] <atgo>	 heyo - does anyone know why etherpad seems to be down?
[17:35:10] <ebernhardson>	 etherpad just fell over?
[17:35:19] <ebernhardson>	 ok, not just me :)
[17:35:24] <atgo>	 seems like it
[17:37:31] <grrrit-wm>	 (03PS2) 10Chad: Phab (labs): Move sshd to 2222, easier to remember than 222 [puppet] - 10https://gerrit.wikimedia.org/r/235777 
[17:37:49] <Krenair>	 at least it's not 29418 ostriches 
[17:38:00] <grrrit-wm>	 (03PS1) 10BBlack: switch wmfusercontent.org to RSA-only temporarily [puppet] - 10https://gerrit.wikimedia.org/r/238195 
[17:38:10] <ostriches>	 Krenair: That's for system sshd, most people won't use.
[17:38:11] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] switch wmfusercontent.org to RSA-only temporarily [puppet] - 10https://gerrit.wikimedia.org/r/238195 (owner: 10BBlack)
[17:38:16] <Krenair>	 ah
[17:38:18] <ostriches>	 Git's SSH will be more locked down, and on :22
[17:38:26] <icinga-wm>	 RECOVERY - dhclient process on mw1006 is OK: PROCS OK: 0 processes with command name dhclient
[17:38:35] <icinga-wm>	 RECOVERY - SSH on mw1006 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0)
[17:38:45] <icinga-wm>	 RECOVERY - nutcracker process on mw1006 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[17:38:45] <icinga-wm>	 RECOVERY - DPKG on mw1006 is OK: All packages OK
[17:40:13] <Krenair>	 atgo, ebernhardson: it looks up
[17:40:16] <Krenair>	 but very slow
[17:40:37] <icinga-wm>	 PROBLEM - etherpad.wikimedia.org HTTP on etherpad1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:41:43] <Krenair>	 unuseably slow
[17:41:46] <atgo>	 definitely totally down for me, krenair
[17:41:58] <Krenair>	 and sometimes it looks completely down
[17:42:02] <Krenair>	 I did manage to load a pad just now though
[17:43:46] <icinga-wm>	 PROBLEM - SSH on mw1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:43:57] <icinga-wm>	 PROBLEM - nutcracker process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:43:57] <icinga-wm>	 PROBLEM - DPKG on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:44:02] <grrrit-wm>	 (03CR) 10Chad: "Looks mostly good, minor nit inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/237096 (https://phabricator.wikimedia.org/T128) (owner: 1020after4)
[17:44:37] <Krenair>	 Yeah, it's just broken
[17:44:44] <Krenair>	 I think something similar to this happened recently
[17:44:52] <Krenair>	 I forget who was able to fix it
[17:45:07] <grrrit-wm>	 (03PS1) 10BBlack: Revert "switch phab altdom to phab.wikidata.org T112381" [puppet] - 10https://gerrit.wikimedia.org/r/238196 
[17:45:09] <grrrit-wm>	 (03PS1) 10BBlack: Revert "Temporarily move phab altdom into wikivoyage.org" [puppet] - 10https://gerrit.wikimedia.org/r/238197 
[17:45:17] <icinga-wm>	 PROBLEM - dhclient process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[17:45:21] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] Revert "switch phab altdom to phab.wikidata.org T112381" [puppet] - 10https://gerrit.wikimedia.org/r/238196 (owner: 10BBlack)
[17:45:32] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] Revert "Temporarily move phab altdom into wikivoyage.org" [puppet] - 10https://gerrit.wikimedia.org/r/238197 (owner: 10BBlack)
[17:45:45] <icinga-wm>	 RECOVERY - etherpad.wikimedia.org HTTP on etherpad1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 522 bytes in 0.011 second response time
[17:46:03] <Krenair>	 only incident docs I found for it were https://wikitech.wikimedia.org/wiki/Incident_documentation/20140714-Etherpad so I guess nobody bothered to write any last time
[17:48:25] <grrrit-wm>	 (03PS1) 10BBlack: Revert "Temporarily create phab.wikivoyage.org" [dns] - 10https://gerrit.wikimedia.org/r/238198 
[17:48:32] <legoktm>	 ostriches: woohoo, yay for using 22 :D
[17:49:38] <grrrit-wm>	 (03PS2) 10BBlack: Revert "Temporarily create phab.wikivoyage.org" [dns] - 10https://gerrit.wikimedia.org/r/238198 
[17:49:43] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] Revert "Temporarily create phab.wikivoyage.org" [dns] - 10https://gerrit.wikimedia.org/r/238198 (owner: 10BBlack)
[17:49:56] <grrrit-wm>	 (03PS1) 10BBlack: Revert "add phab.wikidata.org temporarily T112381" [dns] - 10https://gerrit.wikimedia.org/r/238199 
[17:50:30] <grrrit-wm>	 (03PS2) 10BBlack: Revert "add phab.wikidata.org temporarily T112381" [dns] - 10https://gerrit.wikimedia.org/r/238199 
[17:50:41] <grrrit-wm>	 (03PS1) 10BBlack: Revert "misc-web: temporarily broaden user content domain match for phab" [puppet] - 10https://gerrit.wikimedia.org/r/238200 
[17:50:46] <grrrit-wm>	 (03PS2) 10BBlack: Revert "misc-web: temporarily broaden user content domain match for phab" [puppet] - 10https://gerrit.wikimedia.org/r/238200 
[17:53:31] <wikibugs>	 6operations, 5Patch-For-Review: Undo phab.wikidata.org hacks after wmfusercontent.org cert is fixed - https://phabricator.wikimedia.org/T112381#1638290 (10BBlack) The reason we avoided wikimedia.org is the same reason this domain exists at all: it's a known security problem if phab loads user-defined content f...
[17:54:51] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] Revert "add phab.wikidata.org temporarily T112381" [dns] - 10https://gerrit.wikimedia.org/r/238199 (owner: 10BBlack)
[17:55:56] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] Revert "misc-web: temporarily broaden user content domain match for phab" [puppet] - 10https://gerrit.wikimedia.org/r/238200 (owner: 10BBlack)
[17:56:37] <icinga-wm>	 RECOVERY - Disk space on mw1006 is OK: DISK OK
[17:57:05] <icinga-wm>	 RECOVERY - configured eth on mw1006 is OK: OK - interfaces up
[17:57:17] <icinga-wm>	 RECOVERY - dhclient process on mw1006 is OK: PROCS OK: 0 processes with command name dhclient
[17:57:26] <icinga-wm>	 RECOVERY - SSH on mw1006 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0)
[17:57:36] <icinga-wm>	 RECOVERY - nutcracker process on mw1006 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[17:57:36] <icinga-wm>	 RECOVERY - DPKG on mw1006 is OK: All packages OK
[17:57:46] <icinga-wm>	 RECOVERY - nutcracker port on mw1006 is OK: TCP OK - 0.000 second response time on port 11212
[17:57:56] <icinga-wm>	 RECOVERY - puppet last run on mw1006 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures
[17:58:06] <icinga-wm>	 RECOVERY - salt-minion processes on mw1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[17:58:06] <icinga-wm>	 RECOVERY - RAID on mw1006 is OK: OK: no RAID installed
[18:03:43] <wikibugs>	 6operations, 5Patch-For-Review: Undo phab.wikidata.org hacks after wmfusercontent.org cert is fixed - https://phabricator.wikimedia.org/T112381#1638349 (10BBlack) Everything's reverted back to a normal state on the new cert now, with the exception of:  https://gerrit.wikimedia.org/r/#/c/238195/ (RSA-only)  whi...
[18:05:20] <wikibugs>	 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1638365 (10mmodell) jcsrespo: I see. I don't have a very good understanding of mysql blob performance. I would have assumed that it handles large blobs fairly wel...
[18:05:26] <urandom>	 !log rebuilding restbase-test2001.codfw (nodetool rebuild -- eqiad)
[18:05:32] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:06:05] <grrrit-wm>	 (03PS3) 10Tim Landscheidt: Tools: Migrate from labsdebrepo to aptly [puppet] - 10https://gerrit.wikimedia.org/r/238089 (https://phabricator.wikimedia.org/T111708) 
[18:06:07] <grrrit-wm>	 (03PS1) 10Yuvipanda: aptly: Pin per-project aptly repository [puppet] - 10https://gerrit.wikimedia.org/r/238201 
[18:06:49] <wikibugs>	 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1638376 (10jcrespo) > I would have assumed that it handles large blobs fairly well.  It generally does, I wonder if large data movement can cause stalls, because...
[18:07:23] <grrrit-wm>	 (03CR) 10Yuvipanda: "I76b53c1073cbd007107ad8f60512f201a6583d31 does the pinning at the source. I've also applied the aptly::client role to all instances via Hi" [puppet] - 10https://gerrit.wikimedia.org/r/238089 (https://phabricator.wikimedia.org/T111708) (owner: 10Tim Landscheidt)
[18:07:46] <grrrit-wm>	 (03PS1) 10BBlack: Add new ECDSA cert for wmfusercontent [puppet] - 10https://gerrit.wikimedia.org/r/238202 
[18:08:23] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] Add new ECDSA cert for wmfusercontent [puppet] - 10https://gerrit.wikimedia.org/r/238202 (owner: 10BBlack)
[18:09:04] <grrrit-wm>	 (03PS1) 10BBlack: Revert "switch wmfusercontent.org to RSA-only temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/238203 
[18:09:09] <grrrit-wm>	 (03PS2) 10BBlack: Revert "switch wmfusercontent.org to RSA-only temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/238203 
[18:09:16] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] Revert "switch wmfusercontent.org to RSA-only temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/238203 (owner: 10BBlack)
[18:10:15] <wikibugs>	 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1638392 (10mmodell) There is quite a lot of activity on repositories but I didn't think the volume had changed very much in the past several weeks.  There has bee...
[18:10:39] <grrrit-wm>	 (03PS2) 10Yuvipanda: labs_lvm: Only run extend-instance-vol when needed [puppet] - 10https://gerrit.wikimedia.org/r/235642 (https://phabricator.wikimedia.org/T109933) (owner: 10Tim Landscheidt)
[18:11:00] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] labs_lvm: Only run extend-instance-vol when needed [puppet] - 10https://gerrit.wikimedia.org/r/235642 (https://phabricator.wikimedia.org/T109933) (owner: 10Tim Landscheidt)
[18:11:57] <wikibugs>	 6operations, 5Patch-For-Review: Undo phab.wikidata.org hacks after wmfusercontent.org cert is fixed - https://phabricator.wikimedia.org/T112381#1638396 (10BBlack) 5Open>3Resolved a:3BBlack The ECDSA re-issue was much quicker than expected (must be automated now for simple cases), so the last bit is rever...
[18:12:06] <grrrit-wm>	 (03PS2) 10Yuvipanda: Tools: Accept mail for all submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/237863 (https://phabricator.wikimedia.org/T63484) (owner: 10Tim Landscheidt)
[18:12:13] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] Tools: Accept mail for all submit hosts [puppet] - 10https://gerrit.wikimedia.org/r/237863 (https://phabricator.wikimedia.org/T63484) (owner: 10Tim Landscheidt)
[18:14:06] <wikibugs>	 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1638411 (10chasemp) Just for historical perspective, when we first implemented we knew that MySQL was a stopgap store (for how long we didn't know), and decided t...
[18:15:08] <SPF|Cloud>	 Phabricator still is unstyled for me, stupid cache I guess? :(
[18:15:25] <SPF|Cloud>	 Firefox says sec_error_ocsp_unknown_cert for https://phab.wmfusercontent.org/
[18:16:17] <bblack>	 oh OCSP heh
[18:16:25] <bblack>	 it would fix itself within an hour, but I'll push it around faster
[18:16:40] <bblack>	 (ocs updater needs to re-run for adding back the ECDSA cert)
[18:17:05] <bblack>	 try again, should be fixed
[18:17:20] <SPF|Cloud>	 bblack: yep, thanks
[18:18:06] <icinga-wm>	 PROBLEM - puppet last run on mw2105 is CRITICAL: CRITICAL: puppet fail
[18:21:03] <wikibugs>	 6operations: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1638425 (10EBernhardson) 3NEW
[18:21:31] <ebernhardson>	 i want to request having two servers in eqiad moved to different racks, should i email ops, or assign ticket to someone in particular, or how would i go about that?
[18:22:52] <wikibugs>	 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1638437 (10JohnLewis)
[18:23:03] <wikibugs>	 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1638441 (10EBernhardson)
[18:23:26] <JohnFLewis>	 ebernhardson: I added the dc project and CC'd Chris to the task. he (or someone) should pick it up and ask him to look at it :)
[18:23:37] <JohnFLewis>	 cmjohnson: ^ that also works since he is here :)
[18:24:01] <ebernhardson>	 JohnFLewis: thanks
[18:25:37] <grrrit-wm>	 (03Abandoned) 10Thcipriani: Add config deployment [tools/scap] - 10https://gerrit.wikimedia.org/r/235385 (owner: 10Thcipriani)
[18:28:22] <wikibugs>	 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1638476 (10Cmjohnson) Any particular servers you would like move or just take 2 that makes the most sense?   I am thinking  elastic1031 => A3 elastic103...
[18:28:36] <cmjohnson>	 johnflewis ^
[18:29:36] <JohnFLewis>	 ebernhardson: ^ enjoy the discussion! :)
[18:30:16] <icinga-wm>	 RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0]
[18:30:32] <grrrit-wm>	 (03PS2) 1020after4: SSH repo hosting support for phabricator. [puppet] - 10https://gerrit.wikimedia.org/r/237096 (https://phabricator.wikimedia.org/T128) 
[18:31:25] <grrrit-wm>	 (03PS3) 1020after4: SSH repo hosting support for phabricator. [puppet] - 10https://gerrit.wikimedia.org/r/237096 (https://phabricator.wikimedia.org/T128) 
[18:31:53] <grrrit-wm>	 (03CR) 1020after4: "I've addressed chad's concern and rebased on current production branch." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/237096 (https://phabricator.wikimedia.org/T128) (owner: 1020after4)
[18:32:03] <wikibugs>	 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1638480 (10EBernhardson) In terms of exact servers, whichever makes the most sense.I would like to see servers moved into both A and C racks for availab...
[18:32:20] <wikibugs>	 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1638482 (10chasemp) >>! In T112559#1638476, @Cmjohnson wrote: > Any particular servers you would like move or just take 2 that makes the most sense?  >...
[18:33:32] <grrrit-wm>	 (03CR) 1020after4: [C: 031] Phab (labs): Move sshd to 2222, easier to remember than 222 [puppet] - 10https://gerrit.wikimedia.org/r/235777 (owner: 10Chad)
[18:33:45] <icinga-wm>	 PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api is CRITICAL: Test web site in alternative language returned the unexpected status 520 (expecting: 200)
[18:33:59] <_joe_>	 uh
[18:34:47] <icinga-wm>	 PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api is CRITICAL: Test web site in alternative language returned the unexpected status 520 (expecting: 200)
[18:34:59] <grrrit-wm>	 (03CR) 10Hashar: nodepool: sudo rules for contint-admins (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/235742 (https://phabricator.wikimedia.org/T111374) (owner: 10Hashar)
[18:35:21] <grrrit-wm>	 (03CR) 1020after4: [C: 031] Phab: clean up role, remove ::config and ::main abstraction [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad)
[18:35:26] <grrrit-wm>	 (03PS3) 10Rush: Phab (labs): Move sshd to 2222, easier to remember than 222 [puppet] - 10https://gerrit.wikimedia.org/r/235777 (owner: 10Chad)
[18:36:37] <grrrit-wm>	 (03PS3) 10Hashar: nodepool: sudo rules for contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/235742 (https://phabricator.wikimedia.org/T111374) 
[18:36:44] <grrrit-wm>	 (03PS2) 10Rush: Phab: clean up role, remove ::config and ::main abstraction [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad)
[18:39:19] <grrrit-wm>	 (03PS1) 10Nemo bis: [English Planet] Fetch all Magnus Manske posts [puppet] - 10https://gerrit.wikimedia.org/r/238207 
[18:42:00] <grrrit-wm>	 (03PS1) 10Thcipriani: Add pattern-matching arg to limit deploy hosts [tools/scap] - 10https://gerrit.wikimedia.org/r/238208 
[18:43:56] <grrrit-wm>	 (03PS3) 10Ori.livneh: Slightly increase RESTBase job runner concurrency [puppet] - 10https://gerrit.wikimedia.org/r/237868 (owner: 10GWicke)
[18:44:07] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] Slightly increase RESTBase job runner concurrency [puppet] - 10https://gerrit.wikimedia.org/r/237868 (owner: 10GWicke)
[18:44:42] <grrrit-wm>	 (03PS2) 10Ori.livneh: [English Planet] Fetch all Magnus Manske posts [puppet] - 10https://gerrit.wikimedia.org/r/238207 (owner: 10Nemo bis)
[18:45:03] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] "Magnus is Wiki." [puppet] - 10https://gerrit.wikimedia.org/r/238207 (owner: 10Nemo bis)
[18:45:07] <icinga-wm>	 RECOVERY - puppet last run on mw2105 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[18:46:35] <gwicke>	 ori, thanks re 37868!
[18:47:46] <wikibugs>	 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1638593 (10jcrespo) Well, we have the closest thing to amazon s3, which is Swift...  I do not thing MySQL is a great place to store large files. Any relational da...
[18:47:47] <grrrit-wm>	 (03PS1) 10Thcipriani: Add --environment flag to cli.Application [tools/scap] - 10https://gerrit.wikimedia.org/r/238211 
[18:48:00] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 04-1] "Looks good -- but while you're here, could you make the names uniform, so it's not "adminpass" / "admin_pass", "adminuser" / "admin_user"," [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad)
[18:51:46] <icinga-wm>	 RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy
[18:52:26] <icinga-wm>	 RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy
[18:54:48] <grrrit-wm>	 (03CR) 10Kaldari: "Nemo_bis, TimStarling: What would be the best place to point people to from this code so that they can get an understanding of the history" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237960 (owner: 10Kaldari)
[18:57:38] <legoktm>	 jouncebot: next
[18:57:38] <jouncebot>	 In 1 hour(s) and 2 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150914T2000)
[19:00:28] <grrrit-wm>	 (03PS1) 10Thcipriani: Allow full path to hosts file [tools/scap] - 10https://gerrit.wikimedia.org/r/238213 
[19:04:13] <grrrit-wm>	 (03PS1) 10Andrew Bogott: toolschecker: rename a test to actually reflect what it does. [puppet] - 10https://gerrit.wikimedia.org/r/238215 
[19:04:14] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Added an ldap test. [puppet] - 10https://gerrit.wikimedia.org/r/238216 (https://phabricator.wikimedia.org/T107454) 
[19:05:22] <grrrit-wm>	 (03PS2) 10Andrew Bogott: toolschecker: rename a test to actually reflect what it does. [puppet] - 10https://gerrit.wikimedia.org/r/238215 
[19:05:27] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Added an ldap test. [puppet] - 10https://gerrit.wikimedia.org/r/238216 (https://phabricator.wikimedia.org/T107454) 
[19:06:47] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] toolschecker: rename a test to actually reflect what it does. [puppet] - 10https://gerrit.wikimedia.org/r/238215 (owner: 10Andrew Bogott)
[19:09:24] <grrrit-wm>	 (03CR) 10Zfilipin: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/235695 (https://phabricator.wikimedia.org/T102020) (owner: 10Zfilipin)
[19:09:33] <grrrit-wm>	 (03PS3) 10Andrew Bogott: toolschecker: Added an ldap test. [puppet] - 10https://gerrit.wikimedia.org/r/238216 (https://phabricator.wikimedia.org/T107454) 
[19:09:45] <grrrit-wm>	 (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/235695 (https://phabricator.wikimedia.org/T102020) (owner: 10Zfilipin)
[19:10:31] <grrrit-wm>	 (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/225238 (https://phabricator.wikimedia.org/T102020) (owner: 10Zfilipin)
[19:12:01] <grrrit-wm>	 (03CR) 10Gilles: "https://grafana.wikimedia.org/#/dashboard/db/resourceloader" [puppet] - 10https://gerrit.wikimedia.org/r/234157 (https://phabricator.wikimedia.org/T105681) (owner: 10Gilles)
[19:13:07] <grrrit-wm>	 (03PS4) 10Andrew Bogott: toolschecker: Added an ldap test. [puppet] - 10https://gerrit.wikimedia.org/r/238216 (https://phabricator.wikimedia.org/T107454) 
[19:13:51] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure: move labs role classes to role/labs/foo structure - https://phabricator.wikimedia.org/T112570#1638664 (10Dzahn) 3NEW
[19:19:07] <grrrit-wm>	 (03PS5) 10Andrew Bogott: toolschecker: Added an ldap test. [puppet] - 10https://gerrit.wikimedia.org/r/238216 (https://phabricator.wikimedia.org/T107454) 
[19:20:25] <grrrit-wm>	 (03PS1) 10Hashar: Turn puppet autosign back on beta/integration [puppet] - 10https://gerrit.wikimedia.org/r/238221 (https://phabricator.wikimedia.org/T112537) 
[19:21:18] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] toolschecker: Added an ldap test. [puppet] - 10https://gerrit.wikimedia.org/r/238216 (https://phabricator.wikimedia.org/T107454) (owner: 10Andrew Bogott)
[19:22:13] <grrrit-wm>	 (03CR) 10Nemo bis: "Usually our code comments only link the original request, from which one has to follow links. Maybe just add the mailing list link without" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237960 (owner: 10Kaldari)
[19:23:49] <wikibugs>	 7Puppet, 6Labs: Move all labs-only puppet roles to manifests/role/labs - https://phabricator.wikimedia.org/T107167#1638699 (10Dzahn)
[19:23:50] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure: move labs role classes to role/labs/foo structure - https://phabricator.wikimedia.org/T112570#1638698 (10Dzahn)
[19:25:40] <logmsgbot>	 !log legoktm@tin Synchronized php-1.26wmf22/extensions/Echo/: Only load nojs Special:Notifications styles on the special page (duration: 00m 12s)
[19:25:46] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:26:09] <grrrit-wm>	 (03PS5) 10Dzahn: phragile: Add role class [puppet] - 10https://gerrit.wikimedia.org/r/227466 (https://phabricator.wikimedia.org/T108803) (owner: 10WMDE-leszek)
[19:26:27] <grrrit-wm>	 (03CR) 10Rush: [C: 031] "seems good, some of the admin_pass weird naming is meant to reflect the weird naming in ops/private I think. Not that it necessarily mean" [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad)
[19:26:51] <ori>	 greg-g, thcipriani, twentyafterfour, ostriches: FYI, I've asked the collab team to push out fixes for T112401 ASAP, so they'll probably do sync-dir/files throughout the day.
[19:27:31] <twentyafterfour>	 I don't have any deployments today
[19:27:37] <grrrit-wm>	 (03PS6) 10Dzahn: phragile: Add role class [puppet] - 10https://gerrit.wikimedia.org/r/227466 (https://phabricator.wikimedia.org/T108803) (owner: 10WMDE-leszek)
[19:28:48] <wikibugs>	 6operations, 7Database: Drop phlegal_* databases from m3 - https://phabricator.wikimedia.org/T112573#1638724 (10chasemp) 3NEW a:3jcrespo
[19:29:49] <grrrit-wm>	 (03PS1) 10Andrew Bogott: toolschecker: s/labss/labs [puppet] - 10https://gerrit.wikimedia.org/r/238223 
[19:30:37] <wikibugs>	 6operations, 7Database: Drop phlegal_* databases from m3 - https://phabricator.wikimedia.org/T112573#1638754 (10chasemp) see: https://gerrit.wikimedia.org/r/#/c/235778/1  for related cleanup
[19:30:51] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] toolschecker: s/labss/labs [puppet] - 10https://gerrit.wikimedia.org/r/238223 (owner: 10Andrew Bogott)
[19:31:27] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "amended to be called "labsphragile" rather than "phragile". i also don't think it looks great but that is the current standard today if yo" [puppet] - 10https://gerrit.wikimedia.org/r/227466 (https://phabricator.wikimedia.org/T108803) (owner: 10WMDE-leszek)
[19:32:04] <grrrit-wm>	 (03PS2) 10Andrew Bogott: toolschecker: s/labss/labs [puppet] - 10https://gerrit.wikimedia.org/r/238223 
[19:35:45] <icinga-wm>	 PROBLEM - check_puppetrun on beryllium is CRITICAL: CRITICAL: Puppet has 1 failures
[19:36:54] <grrrit-wm>	 (03CR) 1020after4: [C: 031] Add pattern-matching arg to limit deploy hosts [tools/scap] - 10https://gerrit.wikimedia.org/r/238208 (owner: 10Thcipriani)
[19:40:23] <grrrit-wm>	 (03CR) 10Ori.livneh: Basic role for Sentry (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles)
[19:40:45] <icinga-wm>	 PROBLEM - check_puppetrun on beryllium is CRITICAL: CRITICAL: Puppet has 1 failures
[19:41:39] <grrrit-wm>	 (03CR) 1020after4: [C: 032] Allow full path to hosts file [tools/scap] - 10https://gerrit.wikimedia.org/r/238213 (owner: 10Thcipriani)
[19:42:05] <grrrit-wm>	 (03PS4) 10Rush: Phab (labs): Move sshd to 2222, easier to remember than 222 [puppet] - 10https://gerrit.wikimedia.org/r/235777 (owner: 10Chad)
[19:42:16] <grrrit-wm>	 (03CR) 10Rush: [C: 032 V: 032] "no objection here" [puppet] - 10https://gerrit.wikimedia.org/r/235777 (owner: 10Chad)
[19:43:49] <bblack>	 gilles: so, revert https://gerrit.wikimedia.org/r/#/c/234157/ to unbreak?
[19:43:53] <bblack>	 I can push that through right now
[19:44:19] <bblack>	 unless you have some other plan or simple fix
[19:45:07] <wikibugs>	 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1638827 (10chasemp) For the moment none of these three can be missing at the same time:  hieradata/hosts/elastic1001.yaml:elasticsearch::master_eligible...
[19:45:17] <ori>	 bblack: let me take a quick look
[19:45:45] <icinga-wm>	 RECOVERY - check_puppetrun on beryllium is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[19:46:16] <bblack>	 ok
[19:48:51] <ori>	 bblack: i got it, fix incomign
[19:49:13] <wikibugs>	 6operations, 10hardware-requests: Request three servers for Pageview API - https://phabricator.wikimedia.org/T111053#1638833 (10Ottomata) The specs of those are all the same.  We'll use   - analytics1011 - analytics1016 - analytics1019  These will be reinstalled with Jessie and renamed.  The current node names...
[19:50:20] <grrrit-wm>	 (03PS3) 10Dzahn: admin: add kartik to apertium-admins [puppet] - 10https://gerrit.wikimedia.org/r/235854 (https://phabricator.wikimedia.org/T111360) 
[19:51:03] <wikibugs>	 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1638840 (10mmodell) So we could potentially have phabricator store it's files in swift?
[19:52:48] <grrrit-wm>	 (03PS1) 10Ori.livneh: Follow-up for I8be1929b2: remove mediawiki::monitoring::webserver [puppet] - 10https://gerrit.wikimedia.org/r/238237 
[19:52:50] <grrrit-wm>	 (03PS1) 10Ori.livneh: Follow-up for Iae36f1: actually provision the varnishprocessor module [puppet] - 10https://gerrit.wikimedia.org/r/238238 
[19:53:03] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] Follow-up for I8be1929b2: remove mediawiki::monitoring::webserver [puppet] - 10https://gerrit.wikimedia.org/r/238237 (owner: 10Ori.livneh)
[19:54:09] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Request to access apertium-apy service restart - https://phabricator.wikimedia.org/T111360#1638856 (10Dzahn) a:3Dzahn
[19:55:14] <grrrit-wm>	 (03CR) 10Gilles: [C: 031] "Duh" [puppet] - 10https://gerrit.wikimedia.org/r/238238 (owner: 10Ori.livneh)
[19:56:01] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Follow-up for Iae36f1: actually provision the varnishprocessor module [puppet] - 10https://gerrit.wikimedia.org/r/238238 (owner: 10Ori.livneh)
[19:57:10] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Request to access apertium-apy service restart - https://phabricator.wikimedia.org/T111360#1638872 (10Dzahn) @KartikMistry I merged the change now since it was approved. It adds you to the group and the group to all nodes with the "sca" role.  members: [...
[19:57:37] <icinga-wm>	 PROBLEM - puppet last run on mw1026 is CRITICAL: CRITICAL: Puppet has 1 failures
[19:57:55] <icinga-wm>	 PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Puppet has 1 failures
[19:58:06] <icinga-wm>	 PROBLEM - puppet last run on mw1139 is CRITICAL: CRITICAL: Puppet has 1 failures
[19:58:07] <icinga-wm>	 PROBLEM - puppet last run on mw2020 is CRITICAL: CRITICAL: Puppet has 1 failures
[19:58:16] <icinga-wm>	 PROBLEM - puppet last run on mw2091 is CRITICAL: CRITICAL: Puppet has 1 failures
[19:58:16] <icinga-wm>	 PROBLEM - puppet last run on mw1219 is CRITICAL: CRITICAL: Puppet has 1 failures
[19:58:16] <icinga-wm>	 PROBLEM - puppet last run on mw2069 is CRITICAL: CRITICAL: Puppet has 1 failures
[19:58:31] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: contint-admins can't start/stop nodepool (lack sudo) - https://phabricator.wikimedia.org/T111374#1638874 (10Dzahn) a:3Dzahn
[19:58:45] <icinga-wm>	 PROBLEM - puppet last run on mw2157 is CRITICAL: CRITICAL: Puppet has 1 failures
[19:58:46] <icinga-wm>	 PROBLEM - puppet last run on mw2033 is CRITICAL: CRITICAL: Puppet has 1 failures
[19:59:07] <icinga-wm>	 PROBLEM - puppet last run on mw2141 is CRITICAL: CRITICAL: Puppet has 1 failures
[19:59:07] <icinga-wm>	 PROBLEM - puppet last run on mw2076 is CRITICAL: CRITICAL: Puppet has 1 failures
[20:00:02] <wikibugs>	 7Puppet, 6Labs: Move all labs-only puppet roles to manifests/role/labs - https://phabricator.wikimedia.org/T107167#1638882 (10scfc) In https://gerrit.wikimedia.org/r/#/c/230928/1/manifests/role/labsvagrantlxc.pp, @akosiaris wrote that roles should move to `modules/role/manifests/` in the long term, so ideally...
[20:00:04] <jouncebot>	 gwicke cscott arlolra subbu mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150914T2000). Please do the needful.
[20:00:29] <bblack>	 ori: the crits above are related to removing the hhvm monitoring thingy
[20:00:58] <bblack>	 (might be race conditions though)
[20:01:01] <ori>	 bblack: yes, but they are for hosts that were mid-run
[20:01:01] <ori>	 yeah
[20:01:29] <bblack>	 ok
[20:01:50] <ori>	 bblack: the varnish thing should be fixed as soon as that patch rolls out to all varnishes
[20:02:09] <ori>	 i forced a run on cp1048 and it worked correctly
[20:02:22] <wikibugs>	 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1638888 (10Mike_Peel) >>! In T104735#1632982, @Ricordisamoa wrote: >>>! In T104735#1632583, @BBlack wrote: >> This seems to be satisfy pointless curiosity of users who look at a browser developer console and...
[20:03:02] <wikibugs>	 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1638894 (10chasemp) >>! In T109279#1638840, @mmodell wrote: > So we could potentially have phabricator store it's files in swift?  1. Yes, but not sure how much w...
[20:03:16] <icinga-wm>	 RECOVERY - puppet last run on mw2020 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[20:03:57] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: contint-admins can't start/stop nodepool (lack sudo) - https://phabricator.wikimedia.org/T111374#1638900 (10Dzahn) In the ops meeting it has been said that this is approved in principal, but that we don't want to use sy...
[20:04:11] <subbu>	 starting parsoid deploy
[20:04:35] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1638901 (10Dzahn) a:3Dzahn
[20:05:49] <wikibugs>	 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1638902 (10mmodell) @chasemp: upstream phabricator is almost certainly storing things in s3.  1. Swift should be similar enough to s3 to make it an easy integrati...
[20:07:53] <wikibugs>	 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1638915 (10chasemp) I think there is some difference in API for swift vs s3, as in swift has a subset.  paging @fgiunchedi who should know offhand.
[20:10:49] <grrrit-wm>	 (03PS1) 10Dzahn: admins: add tfinc to elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/238301 (https://phabricator.wikimedia.org/T111473) 
[20:11:38] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] admins: add tfinc to elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/238301 (https://phabricator.wikimedia.org/T111473) (owner: 10Dzahn)
[20:12:08] <wikibugs>	 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1638926 (10chasemp) Also I think this will need to be updated at this time:  hieradata/regex.yaml  es_rack_a3:   __regex: !ruby/regexp /^elastic100[0-6]...
[20:14:22] <gilles>	 ori: thanks for the fix
[20:15:02] <wikibugs>	 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1638960 (10BBlack) The problem here is with people's perceptions mostly :/  It's a common pattern to use multiple domainnames to fetch sub-resources of a site.  Aside from the obvious examples like gstatic, e...
[20:15:14] <subbu>	 !log deployed parsoid sha 3d5f4359
[20:15:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:16:13] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1638967 (10Tfinc) @Dzahn, looks like you this patch set adds "tfinc" instead of "tomasz" I'd mention that in CR but gerrit is not letting me in
[20:16:55] <wikibugs>	 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1638971 (10chasemp)
[20:18:06] <grrrit-wm>	 (03PS2) 10Dzahn: admins: add Tomasz Finc to elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/238301 (https://phabricator.wikimedia.org/T111473) 
[20:18:44] <ori>	 gilles: is the varnishmedia data starting to come in?
[20:18:45] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1638977 (10Dzahn) @Tfinc ah, thanks, fixed!  Do we need to investigate the Gerrit issue?
[20:20:13] <grrrit-wm>	 (03PS3) 10Dzahn: admins: add Tomasz Finc to elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/238301 (https://phabricator.wikimedia.org/T111473) 
[20:21:00] <andrewbogott>	 !log graceful’d apache2 on labcontrol1001
[20:21:05] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:22:08] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] admins: add Tomasz Finc to elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/238301 (https://phabricator.wikimedia.org/T111473) (owner: 10Dzahn)
[20:22:24] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1638986 (10Krenair) Gerrit should be letting you in using your wikitech credentials. It's working for me...
[20:22:46] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1638996 (10Dzahn) Approved in ops meeting today.  Merged.
[20:23:14] <Gilles-phone>	 Ori: just walked out but yes I saw the data come in. One server's worth, cp1048 I presume
[20:23:16] <icinga-wm>	 RECOVERY - puppet last run on mw2091 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[20:23:40] <mutante>	 anyone know why puppet is disabled on elasticsearch?
[20:23:41] <Gilles-phone>	 Ori: I've added a graph for it to the media dashboard
[20:23:48] <mutante>	 administratively disabled (Reason: 'reason not specified');
[20:24:02] <mutante>	 can't add new admins without :)
[20:24:16] <icinga-wm>	 RECOVERY - puppet last run on mw2076 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[20:24:18] <mutante>	 please add reason.. or it's a bug that it's disabled
[20:24:35] <icinga-wm>	 RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
[20:24:45] <icinga-wm>	 RECOVERY - puppet last run on mw1139 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[20:24:56] <icinga-wm>	 RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[20:25:04] <chasemp>	 mutante: if it's 1001 I added a reason as a second command I must not not work out
[20:25:06] <chasemp>	 my apologies
[20:25:25] <icinga-wm>	 RECOVERY - puppet last run on mw2157 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:25:26] <icinga-wm>	 RECOVERY - puppet last run on mw2033 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[20:25:28] <wikibugs>	 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639009 (10Mike_Peel) >>! In T104735#1638960, @BBlack wrote: > The problem here is with people's perceptions mostly :/  It's a common pattern to use multiple domainnames to fetch sub-resources of a site.  Asi...
[20:25:40] <mutante>	 chasemp: ah, ok, yes it is 1001
[20:25:47] <icinga-wm>	 RECOVERY - puppet last run on mw2141 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:25:54] <mutante>	 chasemp: should i wait? i'm adding Tomasz, just did on 1002
[20:25:56] <chasemp>	 I ran it, then realized and ran again.  that seems not to actually work.
[20:25:56] <icinga-wm>	 RECOVERY - puppet last run on mw1026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:26:01] <chasemp>	 nope I just fixed it up
[20:26:07] <mutante>	 ok, thanks
[20:26:22] <grrrit-wm>	 (03CR) 10Gergő Tisza: Basic role for Sentry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles)
[20:26:45] <icinga-wm>	 RECOVERY - puppet last run on mw2069 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:28:00] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1639017 (10Dzahn) @Tfinc i saw puppet add your user on elastic1001 and 1002. The other 2 will just follow automatically. Since you already have a shell user and bastion...
[20:28:21] <wikibugs>	 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639019 (10BBlack) >>! In T104735#1639009, @Mike_Peel wrote: >>>! In T104735#1638960, @BBlack wrote: >> The problem here is with people's perceptions mostly :/  It's a common pattern to use multiple domainnam...
[20:28:52] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1639021 (10Dzahn) 5Open>3Resolved @elastic1001:~# id tomasz uid=1155(tomasz) gid=500(wikidev) groups=500(wikidev),709(elasticsearch-roots)
[20:29:41] <mutante>	 kart_: you should have access to control apertium-apy now
[20:30:01] <mutante>	 tfinc: and you have the elastic access now
[20:30:14] <tfinc>	 mutante: thank you 
[20:30:40] <tfinc>	 mutante: did you see my note about "tfinc" vs "tomasz" in the phab task ?
[20:30:43] <mutante>	 yw, let us know if any issues with gerrit 
[20:30:46] <mutante>	 yes, i did
[20:30:55] <mutante>	 i changed it to "tomasz"
[20:31:18] <mutante>	 and confirmed you are in the elasticsearch-roots group
[20:32:15] <wikibugs>	 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639045 (10Mike_Peel) >>! In T104735#1639019, @BBlack wrote: > Yeah but if you look closer, that example (en + bits) doesn't share anything but the trailing `.org`.  It's wiki**P**edia vs wiki**M**edia.  Simi...
[20:35:27] <icinga-wm>	 PROBLEM - puppet last run on mw2125 is CRITICAL: CRITICAL: Puppet has 1 failures
[20:36:10] <wikibugs>	 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639062 (10BBlack) But now we're off in the territory of human comfort levels, not software.  It's still meaningless for any real verification to populate non-existent related hostnames just for people to loo...
[20:36:52] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Request to access apertium-apy service restart - https://phabricator.wikimedia.org/T111360#1639063 (10Dzahn) 5Open>3Resolved @KartikMistry please reopen if any issues, but i don't expect any because it's the same we do for other services in "sca".
[20:37:34] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to stat1003,  stat1002 and bast1001 for JMinor - https://phabricator.wikimedia.org/T111872#1639065 (10Dzahn) a:5jcrespo>3Dzahn
[20:38:16] <wikibugs>	 10Ops-Access-Requests, 6operations: Request to access apertium-apy service restart - https://phabricator.wikimedia.org/T111360#1639067 (10Dzahn)
[20:41:49] <wikibugs>	 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639085 (10Mike_Peel) >>! In T104735#1639062, @BBlack wrote: > But now we're off in the territory of human comfort levels, not software.  It's still meaningless for any real verification to populate non-exist...
[20:45:41] <wikibugs>	 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639106 (10BBlack) >>! In T104735#1639085, @Mike_Peel wrote: >>>! In T104735#1639062, @BBlack wrote: >> But now we're off in the territory of human comfort levels, not software.  It's still meaningless for an...
[20:47:07] <wikibugs>	 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639116 (10Mike_Peel) >>! In T104735#1639106, @BBlack wrote: >>>! In T104735#1639085, @Mike_Peel wrote: >>>>! In T104735#1639062, @BBlack wrote: >>> But now we're off in the territory of human comfort levels,...
[20:49:48] <wikibugs>	 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639132 (10BBlack) Why would anyone go there?  That hostname doesn't exist, and has never been linked anywhere.
[20:53:13] <cscott>	 !log updated OCG to version 5811056e28f2bc6408b6da96095352ab381bb11f
[20:53:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:54:26] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: contint-admins can't start/stop nodepool (lack sudo) - https://phabricator.wikimedia.org/T111374#1639162 (10hashar) Thanks @Dzahn, I followed up on @akosiaris comment and adjust the sudo rule to use service instead of s...
[20:56:10] <wikibugs>	 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639166 (10Mike_Peel) We seem to have come full circle... I'm still not convinced that it's worth spending more time discussing this than it would take to fix the issue!
[20:59:20] <logmsgbot>	 !log legoktm@tin Synchronized php-1.26wmf22/extensions/Echo/: Hack around OOUI's icon pack being too large by creating our own (duration: 00m 12s)
[20:59:25] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:01:51] <grrrit-wm>	 (03PS2) 10Rush: phab: use permissions for files on bot upload [puppet] - 10https://gerrit.wikimedia.org/r/236205 
[21:01:58] <grrrit-wm>	 (03CR) 10Rush: [C: 032 V: 032] phab: use permissions for files on bot upload [puppet] - 10https://gerrit.wikimedia.org/r/236205 (owner: 10Rush)
[21:02:16] <icinga-wm>	 RECOVERY - puppet last run on mw2125 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:05:00] <wikibugs>	 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639200 (10BBlack) These are the kinds of things I've had to deal with over the past several months, most of which could've been avoided by and are a part of this larger philosophical problem, IMHO:  T101048...
[21:05:10] <grrrit-wm>	 (03CR) 10Tim Landscheidt: [C: 031] aptly: Pin per-project aptly repository [puppet] - 10https://gerrit.wikimedia.org/r/238201 (owner: 10Yuvipanda)
[21:08:43] <grrrit-wm>	 (03CR) 10Milimetric: "Quick update: we're finalizing the name of the three servers that have been allocated to this (see https://phabricator.wikimedia.org/T1110" [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric)
[21:09:06] <grrrit-wm>	 (03PS4) 10Dzahn: nodepool: sudo rules for contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/235742 (https://phabricator.wikimedia.org/T111374) (owner: 10Hashar)
[21:10:13] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: contint-admins can't start/stop nodepool (lack sudo) - https://phabricator.wikimedia.org/T111374#1639211 (10Dzahn) @hashar looks all good. thanks. i will merge. i also saw on labnodepool1001 "nodepool" is now recognized...
[21:10:33] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] nodepool: sudo rules for contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/235742 (https://phabricator.wikimedia.org/T111374) (owner: 10Hashar)
[21:12:13] <mutante>	 disabled (Reason: 'Andrew disabling puppet because nodepool is running amok.');
[21:12:36] <mutante>	 andrewbogott: ^ do we know if it can be enabled again? is that fresh or old?
[21:12:54] <mutante>	 i just want a single run to apply sudo changes
[21:13:21] <andrewbogott>	 mutante: I disabled because hashar asked me to...
[21:13:24] <andrewbogott>	 A single run is definitely fine
[21:13:37] <mutante>	 heh, ok, so this is for his own access :)
[21:13:41] <andrewbogott>	 If the nodepool service is running then my reason for disabling is moot
[21:13:49] <andrewbogott>	 so you can leave it enabled if nodepool is already up
[21:14:04] <mutante>	 it does not seem to be running
[21:14:06] <mutante>	 hashar: hi
[21:14:31] <wikibugs>	 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639223 (10hashar) There is nothing requiring http on wmfusercontent.org and I am not sure what would be the use case.  Since the whole domain can host any arbitrary file (per design) and is solely used for i...
[21:14:37] <mutante>	 i will run it once and update on ticket
[21:14:42] <hashar>	 bblack: giving you some support :-}
[21:14:57] <hashar>	 mutante: andrewbogott you can reenable puppet / nodepool -}
[21:15:09] <andrewbogott>	 hashar: ok, thanks
[21:15:11] <hashar>	 I got it killed on thursday iirc , because I thought it could kill labs
[21:15:19] <mutante>	 ok, i'm doing it
[21:15:29] <hashar>	 but labs is fully operational and I tried nodepoold again today and it is all fine as far as I am concerned
[21:15:36] <grrrit-wm>	 (03PS1) 10Andrew Bogott: toolschecker: Added db tests [puppet] - 10https://gerrit.wikimedia.org/r/238323 (https://phabricator.wikimedia.org/T107449) 
[21:15:48] <mutante>	 !log labnodepool1001 - re-enable puppet and nodepool
[21:15:53] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:15:57] <hashar>	 it might create a snapshot image over night and maybe deleted / create an instance
[21:15:57] <logmsgbot>	 !log ori@tin Synchronized php-1.26wmf22/extensions/TitleBlacklist: Ie44fcb500: Avoid checking blacklists in isBlacklisted() for existing titles (duration: 00m 12s)
[21:16:03] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:16:08] <mutante>	 it's running now
[21:16:11] <hashar>	 \O/
[21:16:29] <hashar>	 andrewbogott: I got some doc work ongoing for nodepool :}
[21:16:53] <mutante>	 hashar: %contint-admins ALL = NOPASSWD: /usr/sbin/service nodepool start
[21:16:58] <mutante>	 the new sudo rules are there
[21:17:22] <hashar>	 \O/
[21:17:38] <mutante>	 eh, how many nodes are there..looks
[21:17:57] <grrrit-wm>	 (03PS2) 10Andrew Bogott: toolschecker: Added db tests [puppet] - 10https://gerrit.wikimedia.org/r/238323 (https://phabricator.wikimedia.org/T107449) 
[21:17:59] <hashar>	 only 1 - 5 :-D
[21:18:02] <mutante>	 just one. ok. then this is done :)
[21:18:11] <mutante>	 and you guys can also start and stop it now
[21:18:20] <hashar>	 confirmed
[21:18:22] <hashar>	 \O/
[21:18:25] <mutante>	 :)
[21:19:04] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: contint-admins can't start/stop nodepool (lack sudo) - https://phabricator.wikimedia.org/T111374#1639253 (10Dzahn) 14:17 < mutante> !log labnodepool1001 - re-enable puppet and nodepool  14:19 < mutante> hashar: %contint...
[21:19:13] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Scaling, 5Patch-For-Review: contint-admins can't start/stop nodepool (lack sudo) - https://phabricator.wikimedia.org/T111374#1639254 (10Dzahn) 5Open>3Resolved
[21:20:51] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Scaling: contint-admins can't start/stop nodepool (lack sudo) - https://phabricator.wikimedia.org/T111374#1602573 (10Dzahn)
[21:23:23] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Continuous-Integration-Scaling: contint-admins can't start/stop nodepool (lack sudo) - https://phabricator.wikimedia.org/T111374#1639264 (10hashar) Thanks.  Wrote some lame notes on the wiki https://wikitech.wikimedia.org/w/index.php?title=Nodepool&diff=177626&oldid=177201
[21:24:35] <hashar>	 sleeep time
[21:26:03] <mutante>	 hashar: bonne nuit
[21:32:55] <grrrit-wm>	 (03PS2) 10Yuvipanda: aptly: Pin per-project aptly repository [puppet] - 10https://gerrit.wikimedia.org/r/238201 
[21:33:01] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] aptly: Pin per-project aptly repository [puppet] - 10https://gerrit.wikimedia.org/r/238201 (owner: 10Yuvipanda)
[21:34:14] <wikibugs>	 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1639306 (10mmodell) we wouldn't be the only ones using swift: https://secure.phabricator.com/T5843
[21:43:27] <grrrit-wm>	 (03PS1) 10Dzahn: admin: create shell account for Joshua Minor [puppet] - 10https://gerrit.wikimedia.org/r/238333 (https://phabricator.wikimedia.org/T111872) 
[21:44:12] <grrrit-wm>	 (03CR) 10Dzahn: [C: 04-2] "needs labs user for UID" [puppet] - 10https://gerrit.wikimedia.org/r/238333 (https://phabricator.wikimedia.org/T111872) (owner: 10Dzahn)
[21:48:34] <grrrit-wm>	 (03PS4) 10Yuvipanda: Tools: Migrate from labsdebrepo to aptly [puppet] - 10https://gerrit.wikimedia.org/r/238089 (https://phabricator.wikimedia.org/T111708) (owner: 10Tim Landscheidt)
[21:49:12] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] Tools: Migrate from labsdebrepo to aptly [puppet] - 10https://gerrit.wikimedia.org/r/238089 (https://phabricator.wikimedia.org/T111708) (owner: 10Tim Landscheidt)
[21:49:51] <grrrit-wm>	 (03PS6) 10EBernhardson: Disable dynamic scripting in Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles)
[21:52:39] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003,  stat1002 and bast1001 for JMinor - https://phabricator.wikimedia.org/T111872#1639372 (10Dzahn) @JMinor Hi, i'm Daniel, i'm going to follow-up with this ticket to get you the access now that all requirements are done.  Just...
[21:58:19] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Let contint-admins force run puppet with /usr/local/sbin/puppet-run - https://phabricator.wikimedia.org/T110943#1639409 (10Dzahn) @Robh did we have an outcome today?
[21:59:51] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Let contint-admins force run puppet with /usr/local/sbin/puppet-run - https://phabricator.wikimedia.org/T110943#1639413 (10RobH) Someone said it was approved in the meeting notes, but since I wasn't on clinic du...
[22:01:19] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Let contint-admins force run puppet with /usr/local/sbin/puppet-run - https://phabricator.wikimedia.org/T110943#1639416 (10Dzahn) Ok, thanks, i'll take it then since i'm on duty and just did the other contint-ad...
[22:01:49] <robh>	 mutante: wait
[22:01:51] <robh>	 i may be incorrect
[22:01:55] <robh>	 i just went to doublecheck what i said
[22:02:00] <robh>	 and i may be quickly reverting.
[22:02:47] <mutante>	 robh: ok
[22:02:55] <robh>	 it looks like it wasnt on the meeting this week
[22:03:02] <robh>	 so whoever was on clinic last week missed it i suppose
[22:03:49] <mutante>	 heh, hashar said ". So should be talked about again in the next ops meeting on Monday Sep. 25th."
[22:03:50] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Let contint-admins force run puppet with /usr/local/sbin/puppet-run - https://phabricator.wikimedia.org/T110943#1639418 (10RobH) I was incorrect.  That was a different task  (Checking notes on https://office.wik...
[22:03:55] <mutante>	 but that's a Friday :)
[22:04:08] <robh>	 why is it assigned to me?
[22:04:16] <mutante>	 you took it :)
[22:04:22] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Let contint-admins force run puppet with /usr/local/sbin/puppet-run - https://phabricator.wikimedia.org/T110943#1639419 (10RobH) a:5RobH>3None
[22:04:28] <mutante>	 "If this is approved in the meeting, I'll merge the patchset post meeting."
[22:04:31] <robh>	 oh well, putting back up for grabs, heh
[22:04:39] <robh>	 yes, but then the meeting didnt happen so opps =]
[22:05:37] <mutante>	 ok
[22:05:46] <mutante>	 also, Sep 25th must be a mistake
[22:06:25] <JohnFLewis>	 mutante: it's a Friday. Of course it is a mistake
[22:06:35] <JohnFLewis>	 [unaware of the context]
[22:06:46] <mutante>	 15:06 < mutante> but that's a Friday :)
[22:09:02] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Let contint-admins force run puppet with /usr/local/sbin/puppet-run - https://phabricator.wikimedia.org/T110943#1639428 (10Dzahn) a:3Dzahn
[22:09:13] <wikibugs>	 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639430 (10Platonides) >>! In T104735#1639062, @BBlack wrote: > But now we're off in the territory of human comfort levels, not software.  It's still meaningless for any real verification to populate non-exis...
[22:21:51] <wikibugs>	 10Ops-Access-Requests, 6operations, 7Icinga: give John Lewis permissions to send commands in icinga - https://phabricator.wikimedia.org/T105229#1639464 (10Dzahn) a:3Dzahn
[22:22:38] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003,  stat1002 and bast1001 for JMinor - https://phabricator.wikimedia.org/T111872#1639467 (10Dzahn) a:5Dzahn>3JMinor please assing the ticket back to me when you're done or have a reply. thank you
[22:23:30] <wikibugs>	 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Let contint-admins force run puppet with /usr/local/sbin/puppet-run - https://phabricator.wikimedia.org/T110943#1639470 (10Dzahn) 5Open>3stalled
[22:26:07] <grrrit-wm>	 (03CR) 10EBernhardson: "the issues applying this to beta cluster are unrelated to this patch. I've filed T112585 to capture that problem." [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles)
[22:27:42] <wikibugs>	 7Blocked-on-Operations, 7Puppet, 6Reading-Infrastructure-Team, 10Sentry, and 2 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1639478 (10Dzahn) @tgr and @ori have comments on the patch originally created by @gilles. I'm not sure if all concerns have been addressed ye...
[22:29:24] <wikibugs>	 7Blocked-on-Operations, 7Puppet, 6Reading-Infrastructure-Team, 10Sentry, and 2 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1639484 (10Dzahn) also @akosiaris do you like it nowadays with the current PS?
[22:37:49] <yurik>	 !log deployed tilerator
[22:37:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:42:24] <grrrit-wm>	 (03PS1) 10Dzahn: reprepro: add new distro jessie for mediawiki releases [puppet] - 10https://gerrit.wikimedia.org/r/238348 (https://phabricator.wikimedia.org/T111225) 
[22:43:44] <grrrit-wm>	 (03CR) 10John F. Lewis: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/238348 (https://phabricator.wikimedia.org/T111225) (owner: 10Dzahn)
[22:44:30] <grrrit-wm>	 (03CR) 10Dzahn: "what about "'AlsoAcceptFor' => 'trusty'," ?" [puppet] - 10https://gerrit.wikimedia.org/r/238348 (https://phabricator.wikimedia.org/T111225) (owner: 10Dzahn)
[22:46:23] <grrrit-wm>	 (03CR) 10Dzahn: "@filippo am i doing it right? what else do we need?" [puppet] - 10https://gerrit.wikimedia.org/r/238348 (https://phabricator.wikimedia.org/T111225) (owner: 10Dzahn)
[22:47:31] <grrrit-wm>	 (03CR) 10John F. Lewis: reprepro: add new distro jessie for mediawiki releases [puppet] - 10https://gerrit.wikimedia.org/r/238348 (https://phabricator.wikimedia.org/T111225) (owner: 10Dzahn)
[22:48:02] <wikibugs>	 6operations, 5Patch-For-Review: Change distribution in releases.wikimedia.org to "sid" or "jessie" - https://phabricator.wikimedia.org/T111225#1639576 (10Dzahn) @fgiunchedi ^ how does that change look to add jessie? I'm not sure what do put in "AlsoAcceptFor" and if we say 8.0, 8.1 or 8.2 jessie
[22:48:35] <grrrit-wm>	 (03CR) 10John F. Lewis: "Looks like it should be jessie but since I'm not confident and missed it first time, removing code review." [puppet] - 10https://gerrit.wikimedia.org/r/238348 (https://phabricator.wikimedia.org/T111225) (owner: 10Dzahn)
[22:49:37] <grrrit-wm>	 (03PS2) 10Dzahn: reprepro: add new distro jessie for mediawiki releases [puppet] - 10https://gerrit.wikimedia.org/r/238348 (https://phabricator.wikimedia.org/T111225) 
[22:51:35] <wikibugs>	 6operations, 10Flow, 10MediaWiki-Redirects, 3Collaboration-Team-Current, and 2 others: On mobile, the Flow notification's link takes you to the desktop version of the Flow page, even though the main (background) link takes you to the mobile one (main) - https://phabricator.wikimedia.org/T107108#1639601 (10D...
[22:57:55] <mutante>	 andrewbogott: can we start "nova-scheduler" on labcontrol1002 or should it be stopped
[22:59:37] * James_F waves for SWAT, pre-empting the bot.
[22:59:40] <andrewbogott>	 mutante: labcontrol1002 is a hot spare, it shouldn’t really be running anything.  Is there a test for nova-scheduler?
[23:00:04] <jouncebot>	 RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150914T2300). Please do the needful.
[23:00:04] <jouncebot>	 James_F: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[23:00:13] <wikibugs>	 6operations: (www.)wmfusercontent.org should respond to HTTP - https://phabricator.wikimedia.org/T104735#1639636 (10BBlack) >>! In T104735#1639430, @Platonides wrote: >>>! In T104735#1639062, @BBlack wrote: >> But now we're off in the territory of human comfort levels, not software.  It's still meaningless for a...
[23:00:16] <James_F>	 Krenair: is it going to be you?
[23:00:29] <RoanKattouw>	 not it
[23:00:34] * RoanKattouw should go to sleep
[23:00:56] <mutante>	 andrewbogott: yes, i was checking icinga
[23:01:15] <andrewbogott>	 mutante: hm, ok, that probably shouldn’t be tested, I need to purge that host
[23:01:20] <mutante>	 it started about 7h ago fwiw
[23:01:43] <mutante>	 ok, thanks, i'll just disable it for now
[23:01:50] <subbu>	 RoanKattouw, are you in NL?
[23:01:52] <mutante>	 then you can purge it later
[23:02:07] <Krenair>	 James_F, guess so
[23:02:57] <andrewbogott>	 mutante: thanks
[23:03:43] <icinga-wm>	 ACKNOWLEDGEMENT - nova-scheduler process on labcontrol1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-scheduler daniel_zahn hot spare
[23:03:45] <mutante>	 added "persistent comment" -> "hot spare"
[23:05:39] <RoanKattouw>	 subbu: Yeah, for a week
[23:05:47] <ebernhardson>	 James_F: i'll push it out i suppose
[23:05:50] <ebernhardson>	 not seeing volunteers :)
[23:06:22] <James_F>	 ebernhardson: You and Krenair fight. :-)
[23:06:27] <ebernhardson>	 ahh i see krenair got it, excellent.
[23:08:59] <wikibugs>	 6operations, 10Traffic, 7HTTPS: Track/notify cert expiries better - https://phabricator.wikimedia.org/T112521#1639696 (10Dzahn) p:5Triage>3Normal
[23:09:58] <wikibugs>	 6operations, 10ops-eqiad: Swap two elasticsearch servers in row D with an elasticsearch server in racks A3 and C5. - https://phabricator.wikimedia.org/T112559#1639700 (10Dzahn) p:5Triage>3Normal
[23:11:13] <Krenair>	 James_F, why are you naming MediaWiki-Gallery without the prefix?
[23:12:13] <James_F>	 Krenair: The prefix isn't meant to be there.
[23:12:22] <Krenair>	 Why not?
[23:12:23] <James_F>	 Krenair: It's an artefact of the Bugzilla->Phabricator move.
[23:12:35] <James_F>	 It wasn't removed initially in case of clashes.
[23:13:14] <Krenair>	 Why are we not keeping them?
[23:13:30] <James_F>	 Because they clutter the search space and are unhelpful.
[23:13:40] <James_F>	 Anyway, this is off-topic for -operations. :-)
[23:13:50] <Krenair>	 Phabricator doesn't provide a hierarchy of projects
[23:14:19] <James_F>	 Yet.
[23:14:24] <Krenair>	 So it should be prefixed as a part of MediaWiki
[23:14:27] <James_F>	 No.
[23:14:51] <James_F>	 This isn't the venue, and it's settled policy. If you want to change it, you can start a new discussion, but people probably will disagree. :-)
[23:14:57] <Krenair>	 Link please
[23:15:22] <James_F>	 Krenair: Are you deploying or not?
[23:15:43] <wikibugs>	 10Ops-Access-Requests, 6operations: Requesting access to elasticsearch-roots - https://phabricator.wikimedia.org/T111473#1639814 (10Dzahn)
[23:15:57] <James_F>	 I've got stuff to do, like fixing Commons. :-)
[23:16:08] <Krenair>	 Send me the link James_F.
[23:16:17] <James_F>	 https://wikitech.wikimedia.org/wiki/Deployments
[23:18:27] <Krenair>	 That's not the link I was looking for
[23:18:55] <James_F>	 Krenair: Do I need to find someone else to deploy?
[23:19:09] <Krenair>	 Depends
[23:19:16] * James_F sighs.
[23:19:17] <James_F>	 ebernhardson: Can you please deploy? Krenair seems to have lost interest and we've got wikis to fix.
[23:19:31] <ebernhardson>	 what is this patch?
[23:19:40] <James_F>	 It's already live in master.
[23:19:46] <James_F>	 https://gerrit.wikimedia.org/r/238334
[23:20:08] <James_F>	 (Fixes an issue with the UploadWizard, which is unhelpful for Commonists during WLM. :-))
[23:20:39] <grrrit-wm>	 (03CR) 10John F. Lewis: [C: 04-1] "Looked at the ticket as I don't see any long term need for this personally. Though I don't object to an amend and re-evaluate." [puppet] - 10https://gerrit.wikimedia.org/r/237865 (https://phabricator.wikimedia.org/T83158) (owner: 10Dzahn)
[23:20:47] <ebernhardson>	 i can't see any reason why not
[23:23:19] <logmsgbot>	 !log ebernhardson@tin Synchronized php-1.26wmf22/extensions/UploadWizard/: Swat out badtoken fix to UploadWizard in 1.26wmf22 (duration: 00m 12s)
[23:23:25] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:23:25] <James_F>	 Thanks.
[23:23:27] <JohnFLewis>	 James_F: I misread that for Communists. whoops :)
[23:23:29] * James_F tests.
[23:23:32] <James_F>	 JohnFLewis: ;-)
[23:24:30] <grrrit-wm>	 (03PS1) 10CSteipp: Enable captchas on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238357 (https://phabricator.wikimedia.org/T86460) 
[23:25:06] <James_F>	 ebernhardson: Thanks, LGTM.
[23:27:29] <logmsgbot>	 !log ebernhardson@tin Synchronized php-1.26wmf22/extensions/WikimediaEvents/: Change bucket selection methods in CompletionSuggestions AB test (duration: 00m 12s)
[23:27:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:27:57] <wikibugs>	 6operations, 7HHVM: /var/cache/hhvm/cli.hhbc.sq3 owned by root on some mw hosts - https://phabricator.wikimedia.org/T112517#1639855 (10Dzahn) p:5Triage>3Normal
[23:28:47] <wikibugs>	 6operations, 10ops-eqiad: db1043 degraded RAID - https://phabricator.wikimedia.org/T112502#1639859 (10Dzahn) p:5Triage>3High
[23:29:20] <wikibugs>	 6operations, 10Traffic: Deprecate pybal SSH health checks - https://phabricator.wikimedia.org/T111899#1639868 (10Dzahn) p:5Triage>3Normal
[23:36:45] <icinga-wm>	 PROBLEM - citoid endpoints health on sca1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:37:52] <grrrit-wm>	 (03CR) 10EBernhardson: [C: 031] Disable dynamic scripting in Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles)
[23:39:48] <mutante>	 sca1002 - NRPE running and All endpoints are healthy
[23:40:36] <icinga-wm>	 PROBLEM - citoid endpoints health on sca1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:40:46] <James_F>	 Oy.
[23:41:13] <James_F>	 mutante: Is the test too sensitive? Or is the service actually flaky?
[23:41:35] <icinga-wm>	 RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy
[23:42:07] <icinga-wm>	 RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy
[23:42:09] <mutante>	 James_F: the service looks ok:
[23:42:14] <James_F>	 Yeah.
[23:42:24] <mutante>	  /usr/local/lib/nagios/plugins/service_checker -t 5 10.64.32.153 http://10.64.32.153:1970
[23:42:30] <mutante>	 All endpoints are healthy
[23:42:32] <mutante>	 that is it
[23:42:47] <mutante>	 i believe it's icinga being too busy for a moment to get the result from NRPE within the timeout
[23:43:52] <mutante>	 to run it locally takes about 1.5 seconds, not 10
[23:44:11] <mutante>	 but on the icinga side, there's lots and lots to run 
[23:44:51] <wikibugs>	 6operations, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1639913 (10Tgr)
[23:45:39] <wikibugs>	 6operations, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1634588 (10Tgr)
[23:46:09] <grrrit-wm>	 (03PS3) 10Dzahn: Add scap scripts to all canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) (owner: 10Jcrespo)
[23:47:23] * James_F nods. Thanks, mutante.
[23:48:04] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/883/mw1017.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/237707 (https://phabricator.wikimedia.org/T112174) (owner: 10Jcrespo)
[23:48:49] <wikibugs>	 6operations, 5Patch-For-Review: mw1017 has outdated broken mwscript - https://phabricator.wikimedia.org/T112174#1639947 (10Dzahn) a:5jcrespo>3Dzahn
[23:50:01] <grrrit-wm>	 (03PS34) 10Gergő Tisza: Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles)
[23:51:16] <icinga-wm>	 RECOVERY - nova-scheduler process on labcontrol1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-scheduler
[23:54:04] <wikibugs>	 6operations, 5Patch-For-Review: mw1017 has outdated broken mwscript - https://phabricator.wikimedia.org/T112174#1639967 (10Dzahn) Sep 14 23:50:10 mw1017 puppet-agent[19262]: (/Stage[main]/Scap::Scripts/File[/usr/local/bin/mwscript]/content) content changed   ``` Notice: /Stage[main]/Scap::Scripts/File[/usr/loc...
[23:55:00] <wikibugs>	 6operations, 5Patch-For-Review: mw1017 has outdated broken mwscript - https://phabricator.wikimedia.org/T112174#1639968 (10Dzahn) 5Open>3Resolved
[23:55:10] <wikibugs>	 6operations: mw1017 has outdated broken mwscript - https://phabricator.wikimedia.org/T112174#1628038 (10Dzahn)
[23:56:16] <icinga-wm>	 PROBLEM - nova-scheduler process on labcontrol1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-scheduler
[23:57:11] <grrrit-wm>	 (03PS1) 10Tim Starling: Update personal .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/238363