[00:14:13] PROBLEM - MySQL Processlist on db1040 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 0 copy to table, 100 statistics [00:17:54] RECOVERY - MySQL Processlist on db1040 is OK: OK 1 unauthenticated, 0 locked, 0 copy to table, 3 statistics [00:23:32] PROBLEM - puppet last run on mw1206 is CRITICAL: CRITICAL: Puppet has 1 failures [00:48:42] RECOVERY - puppet last run on mw1206 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [02:25:20] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.20) (duration: 11m 22s) [02:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:33:56] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Apr 11 02:33:52 UTC 2016 (duration 8m 32s) [02:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:23:32] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 65.52% of data above the critical threshold [5000000.0] [04:23:43] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 65.52% of data above the critical threshold [5000000.0] [04:34:58] !log ran `mwscript extensions/CentralAuth/maintenance/createLocalAccount.php --wiki=enwiki 'Corvin Victor Paul'` due to CA bug [04:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:42:57] !log mwscript deleteEqualMessages.php --wiki shwiki (T45917) [04:42:58] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [04:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:09:32] PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: puppet fail [05:10:43] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [05:14:24] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [05:38:23] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:23:34] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-redis0[12], deployment-sentry2 puppet failures due to missing hiera data redis::shards - https://phabricator.wikimedia.org/T132263#2194207 (10Joe) p:5Triage>3Low [06:23:44] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-redis0[12], deployment-sentry2 puppet failures due to missing hiera data redis::shards - https://phabricator.wikimedia.org/T132263#2192922 (10Joe) a:5Joe>3None [06:24:14] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-redis0[12], deployment-sentry2 puppet failures due to missing hiera data redis::shards - https://phabricator.wikimedia.org/T132263#2192922 (10Joe) @Krenair please stop assigning me tickets instead of subscribing me to those [06:24:34] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-memc0[2-4] puppet failure due to missing hiera data mediawiki::redis_servers::codfw - https://phabricator.wikimedia.org/T132262#2194211 (10Joe) p:5Triage>3Lowest a:5Joe>3None [06:24:52] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-memc0[2-4] puppet failure due to missing hiera data mediawiki::redis_servers::codfw - https://phabricator.wikimedia.org/T132262#2192909 (10Joe) Again, stop assigning vs subscribing people to tickets. [06:30:53] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:02] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:03] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:03] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:04] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: puppet fail [06:31:34] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:43] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:22] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:57] (03PS1) 10Elukey: Reduce cronspam caused by OCG [puppet] - 10https://gerrit.wikimedia.org/r/282650 [06:33:03] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:13] PROBLEM - puppet last run on restbase-test2003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:52] RECOVERY - puppet last run on restbase-test2003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:48:43] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:48:53] (03Abandoned) 10Muehlenhoff: Add ferm rules for statsite [puppet] - 10https://gerrit.wikimedia.org/r/282340 (owner: 10Muehlenhoff) [06:49:09] (03Abandoned) 10Muehlenhoff: Enable base::firewall on labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/282341 (owner: 10Muehlenhoff) [06:56:43] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:57:12] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:57:23] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:57:43] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:23] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:32] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:33] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:33] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:33] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:02:14] (03PS1) 10Elukey: Add complete submodule support to the puppet compiler. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/282652 (https://phabricator.wikimedia.org/T132154) [07:03:07] (03CR) 10jenkins-bot: [V: 04-1] Add complete submodule support to the puppet compiler. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/282652 (https://phabricator.wikimedia.org/T132154) (owner: 10Elukey) [07:04:22] RECOVERY - HHVM rendering on mw1143 is OK: HTTP OK: HTTP/1.1 200 OK - 74372 bytes in 1.165 second response time [07:04:42] RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 74370 bytes in 0.248 second response time [07:04:44] RECOVERY - Apache HTTP on mw1143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.066 second response time [07:05:18] !log restarted hhvm on mw1143, mw1201 and mw1213 [07:05:22] RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.042 second response time [07:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:05:52] RECOVERY - Apache HTTP on mw1213 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.054 second response time [07:06:29] (03PS2) 10Elukey: Add complete submodule support to the puppet compiler. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/282652 (https://phabricator.wikimedia.org/T132154) [07:06:37] good morning to you Jenkins, always a pleasure [07:07:02] RECOVERY - HHVM rendering on mw1213 is OK: HTTP OK: HTTP/1.1 200 OK - 74369 bytes in 0.126 second response time [07:13:33] <_joe_> moritzm: oh did you restart those? [07:13:52] <_joe_> I was looking at them and then went to run an errand, meh [07:14:14] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [07:15:57] 6Operations, 10Beta-Cluster-Infrastructure, 7Performance: Need a way to simulate replication lag to test replag issues - https://phabricator.wikimedia.org/T40945#2194273 (10Nikerabbit) [07:19:07] yeah sorry, restarted all three of those [07:22:27] <_joe_> moritzm: it's my fault, not yours [07:22:43] <_joe_> there is a pretty strange new failure condition [07:22:54] <_joe_> where there is a defunct subprocess of hhvm [07:25:04] I ran hhvm-dump-debug on mw1143, at least the backtrace should still be there [07:26:11] <_joe_> moritzm: cool, I'll take a look [07:48:04] PROBLEM - puppet last run on mw1142 is CRITICAL: CRITICAL: Puppet has 2 failures [07:49:22] PROBLEM - puppet last run on mw1091 is CRITICAL: CRITICAL: Puppet has 1 failures [07:49:53] PROBLEM - puppet last run on elastic1008 is CRITICAL: CRITICAL: Puppet has 1 failures [08:00:04] moritzm: Dear anthropoid, the time has come. Please deploy terbium maintenance (enable ferm) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160411T0800). [08:01:25] 6Operations, 10Ops-Access-Requests: global root access for gilles - https://phabricator.wikimedia.org/T130910#2194295 (10ori) @gilles, I intend to make a new request, either as an edit to this task or as a new task altogether. In anticipation of that, could you provide specific examples of the sort of things y... [08:05:08] (03PS1) 10Muehlenhoff: Enable base::firewall on terbium [puppet] - 10https://gerrit.wikimedia.org/r/282654 [08:06:33] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall on terbium [puppet] - 10https://gerrit.wikimedia.org/r/282654 (owner: 10Muehlenhoff) [08:10:35] (03PS1) 10Filippo Giunchedi: cassandra: add restbase1014-b [puppet] - 10https://gerrit.wikimedia.org/r/282655 [08:12:13] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase1014-b [puppet] - 10https://gerrit.wikimedia.org/r/282655 (owner: 10Filippo Giunchedi) [08:15:02] RECOVERY - puppet last run on mw1091 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [08:15:34] RECOVERY - puppet last run on mw1142 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:15:34] RECOVERY - puppet last run on elastic1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:18:58] !log Disabling Puppet on cluster mysql and parsercache to merge and test change 282385 on db2040, T111654 [08:18:59] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [08:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:19:19] !log repool restbase2002, depool restbase2005 [08:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:20:40] (03PS2) 10Volans: MariaDB: allow multiple MySQL TLS configurations [puppet] - 10https://gerrit.wikimedia.org/r/282385 (https://phabricator.wikimedia.org/T111654) [08:24:02] PROBLEM - MySQL Processlist on db1040 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 0 copy to table, 72 statistics [08:24:11] again... checking [08:26:52] it's recovering, investigating [08:27:44] RECOVERY - MySQL Processlist on db1040 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 28 statistics [08:28:18] !log start raid expansion on restbase2005 T127951 [08:28:19] T127951: expand raid0 in restbase200[1-6] - https://phabricator.wikimedia.org/T127951 [08:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:28:34] (03CR) 10Volans: [C: 032] MariaDB: allow multiple MySQL TLS configurations [puppet] - 10https://gerrit.wikimedia.org/r/282385 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [08:30:48] !log bootstrap restbase1014-b T128107 [08:30:49] T128107: install restbase1010-restbase1015 - https://phabricator.wikimedia.org/T128107 [08:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:31:00] (03PS3) 10Giuseppe Lavagetto: hhvm: hhvm-dump-debug compatibility with jessie [puppet] - 10https://gerrit.wikimedia.org/r/282358 [08:33:23] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: hhvm-dump-debug compatibility with jessie [puppet] - 10https://gerrit.wikimedia.org/r/282358 (owner: 10Giuseppe Lavagetto) [08:34:16] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000000.0] [08:34:52] (03PS2) 10Elukey: Reduce cronspam caused by OCG [puppet] - 10https://gerrit.wikimedia.org/r/282650 [08:35:25] PROBLEM - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is CRITICAL: Connection refused [08:35:39] known ^ icinga race [08:36:10] (03PS1) 10Volans: Removed require that is not directly managed by Puppet [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/282657 (https://phabricator.wikimedia.org/T111654) [08:36:16] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is CRITICAL: Connection refused Filippo Giunchedi bootstrap [08:36:20] (03CR) 10Elukey: [C: 032] Reduce cronspam caused by OCG [puppet] - 10https://gerrit.wikimedia.org/r/282650 (owner: 10Elukey) [08:36:52] (03CR) 10Volans: [C: 032] Removed require that is not directly managed by Puppet [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/282657 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [08:38:35] PROBLEM - puppet last run on mw1148 is CRITICAL: CRITICAL: Puppet has 3 failures [08:38:48] (03PS1) 10Volans: MariaDB: updated submodule reference [puppet] - 10https://gerrit.wikimedia.org/r/282658 (https://phabricator.wikimedia.org/T111654) [08:39:46] 6Operations, 7Performance: Package and deploy Mcrouter as a replacement for twemproxy - https://phabricator.wikimedia.org/T132317#2194315 (10ori) [08:40:37] (03CR) 10Volans: [C: 032] MariaDB: updated submodule reference [puppet] - 10https://gerrit.wikimedia.org/r/282658 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [08:43:15] PROBLEM - Apache HTTP on mw1258 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.016 second response time [08:44:22] !log Re-enabling Puppet on cluster mysql and parsercache to deploy change 282385, T111654 [08:44:23] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [08:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:45:04] RECOVERY - Apache HTTP on mw1258 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.038 second response time [08:45:08] (03PS1) 10Muehlenhoff: Enable base::firewall on rdb2001 [puppet] - 10https://gerrit.wikimedia.org/r/282659 [08:52:34] PROBLEM - puppet last run on mw2098 is CRITICAL: CRITICAL: puppet fail [08:52:49] (03PS2) 10Filippo Giunchedi: diamond: send production traffic via graphite line protocol [puppet] - 10https://gerrit.wikimedia.org/r/281622 (https://phabricator.wikimedia.org/T121861) [08:53:02] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] diamond: send production traffic via graphite line protocol [puppet] - 10https://gerrit.wikimedia.org/r/281622 (https://phabricator.wikimedia.org/T121861) (owner: 10Filippo Giunchedi) [09:05:26] PROBLEM - puppet last run on rdb2004 is CRITICAL: CRITICAL: puppet fail [09:19:15] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [09:20:55] RECOVERY - puppet last run on mw2098 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [09:31:55] RECOVERY - puppet last run on rdb2004 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [09:38:13] (03PS1) 10Muehlenhoff: New helper function for job handling (and related tests) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/282662 [09:39:51] (03CR) 10Muehlenhoff: [C: 032 V: 032] New helper function for job handling (and related tests) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/282662 (owner: 10Muehlenhoff) [09:49:41] (03CR) 10Filippo Giunchedi: [C: 031] "IIRC it was mentioned on irc that it was dropping less packets than statsdlb" [puppet] - 10https://gerrit.wikimedia.org/r/282356 (https://phabricator.wikimedia.org/T126447) (owner: 10Faidon Liambotis) [09:50:05] (03CR) 10Filippo Giunchedi: [C: 031] Remove statsdlb, unreferenced now [puppet] - 10https://gerrit.wikimedia.org/r/282357 (owner: 10Faidon Liambotis) [09:50:48] (03CR) 10Filippo Giunchedi: [C: 031] cache: vary statsd_server with hiera [puppet] - 10https://gerrit.wikimedia.org/r/249490 (https://phabricator.wikimedia.org/T116898) (owner: 10Hashar) [09:51:48] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::firewall on labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/282344 (owner: 10Muehlenhoff) [09:53:49] (03CR) 10Filippo Giunchedi: [C: 031] Add ferm rules for statsite [puppet] - 10https://gerrit.wikimedia.org/r/282343 (owner: 10Muehlenhoff) [09:58:32] (03PS2) 10Muehlenhoff: Add ferm rules for statsite [puppet] - 10https://gerrit.wikimedia.org/r/282343 [09:58:44] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for statsite [puppet] - 10https://gerrit.wikimedia.org/r/282343 (owner: 10Muehlenhoff) [10:04:37] 6Operations, 10Analytics: kafkatee cronspam from oxygen - https://phabricator.wikimedia.org/T132322#2194403 (10elukey) [10:12:46] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 79 failures [10:18:04] PROBLEM - HHVM rendering on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:18:34] PROBLEM - Apache HTTP on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:19:55] PROBLEM - SSH on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:20:04] PROBLEM - DPKG on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:20:04] PROBLEM - dhclient process on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:20:05] PROBLEM - Disk space on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:20:24] PROBLEM - nutcracker port on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:20:25] PROBLEM - HHVM processes on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:20:30] ^ mw1144 seems completely offline. Anyone knows what's happening? Can I help? [10:20:35] PROBLEM - nutcracker process on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:20:58] gehel: I was looking at it too, I believe we'd need to powercycle [10:21:24] PROBLEM - RAID on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:21:35] PROBLEM - Check size of conntrack table on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:21:35] PROBLEM - configured eth on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:21:38] elukey: no better prognosis from here. You take it? [10:22:43] gehel: sure [10:24:44] PROBLEM - salt-minion processes on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:24:56] getting permission denied for the mgmt ssh [10:25:33] ah snap root@ [10:26:15] (03CR) 10Filippo Giunchedi: [C: 031] Manage /etc/pam.d/sshd in role::bastionhost::2fa via puppet [puppet] - 10https://gerrit.wikimedia.org/r/282159 (owner: 10Muehlenhoff) [10:27:07] !log powercycled mw1144.eqiad.wmnet [10:27:09] o/ akosiaris [10:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:29:04] RECOVERY - Check size of conntrack table on mw1144 is OK: OK: nf_conntrack is 0 % full [10:29:04] RECOVERY - configured eth on mw1144 is OK: OK - interfaces up [10:29:14] RECOVERY - SSH on mw1144 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [10:29:15] RECOVERY - DPKG on mw1144 is OK: All packages OK [10:29:24] RECOVERY - dhclient process on mw1144 is OK: PROCS OK: 0 processes with command name dhclient [10:29:25] RECOVERY - Disk space on mw1144 is OK: DISK OK [10:29:35] RECOVERY - nutcracker port on mw1144 is OK: TCP OK - 0.000 second response time on port 11212 [10:29:46] RECOVERY - HHVM processes on mw1144 is OK: PROCS OK: 6 processes with command name hhvm [10:29:46] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 2.529 second response time [10:29:56] RECOVERY - nutcracker process on mw1144 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [10:30:14] RECOVERY - salt-minion processes on mw1144 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:30:45] RECOVERY - RAID on mw1144 is OK: OK: no RAID installed [10:31:14] RECOVERY - HHVM rendering on mw1144 is OK: HTTP OK: HTTP/1.1 200 OK - 74370 bytes in 0.480 second response time [10:31:36] the oom killer had a party on mw1144 [10:33:44] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:40:54] elukey: thanks! [10:42:18] (03PS2) 10Giuseppe Lavagetto: redis: add monitoring define [puppet] - 10https://gerrit.wikimedia.org/r/282383 [10:42:24] (03CR) 10Phuedx: [C: 031] "I haven't been able to test this as I can't get MediaWiki-Vagrant's varnish role to provision – I'm submitting a bug report now." [puppet] - 10https://gerrit.wikimedia.org/r/281031 (https://phabricator.wikimedia.org/T127021) (owner: 10Bmansurov) [10:42:52] 6Operations: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2194449 (10elukey) [10:43:49] 6Operations, 10Analytics: kafkatee cronspam from oxygen - https://phabricator.wikimedia.org/T132322#2194462 (10elukey) [10:43:51] 6Operations, 10Monitoring: Refactor RAID checks (check-raid) - https://phabricator.wikimedia.org/T84050#2194465 (10elukey) [10:43:55] 6Operations, 10media-storage, 13Patch-For-Review: swift-dispersion-stats cronspam when disks are broken - https://phabricator.wikimedia.org/T78762#2194466 (10elukey) [10:43:57] 6Operations: Disable cron.standard checking for lost+found directories - https://phabricator.wikimedia.org/T1249#2194467 (10elukey) [10:43:59] 6Operations: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2194461 (10elukey) [10:45:06] 6Operations: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2194449 (10elukey) p:5Triage>3Normal [10:46:23] 6Operations: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2194474 (10elukey) [10:46:25] 6Operations: improve cron spam visibility - https://phabricator.wikimedia.org/T84845#2194475 (10elukey) [10:52:33] 6Operations: Weak digest algorithm (SHA1) used to sign InRelease on apt.wikimedia.org - https://phabricator.wikimedia.org/T132325#2194482 (10ema) [10:52:35] 6Operations, 10Ops-Access-Requests: global root access for gilles - https://phabricator.wikimedia.org/T130910#2194481 (10Gilles) At the very least I would need to have access to the logs where the self.logger.* calls in rewrite.py end up, which would let me add more debugging information. rewrite.py has many c... [10:52:37] 6Operations: improve cron spam visibility - https://phabricator.wikimedia.org/T84845#2194494 (10fgiunchedi) also note that to test this easily we could either a dummy mailbox to receive cronspam and turn them into sentry messages, this way existing mails wouldn't be touched [11:05:04] PROBLEM - HHVM rendering on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:05:56] PROBLEM - Apache HTTP on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:06:24] PROBLEM - DPKG on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:06:35] PROBLEM - Disk space on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:06:35] PROBLEM - nutcracker port on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:06:35] PROBLEM - dhclient process on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:06:45] PROBLEM - SSH on mw1148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:06:46] PROBLEM - configured eth on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:06:54] PROBLEM - HHVM processes on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:06:54] PROBLEM - nutcracker process on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:07:25] PROBLEM - RAID on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:07:35] PROBLEM - salt-minion processes on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:07:45] PROBLEM - Check size of conntrack table on mw1148 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:08:59] 6Operations, 7Graphite, 13Patch-For-Review: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#2194504 (10fgiunchedi) [11:09:01] 6Operations, 10Monitoring, 13Patch-For-Review: switch diamond to use graphite line protocol - https://phabricator.wikimedia.org/T121861#2194502 (10fgiunchedi) 5Open>3Resolved this has been deployed, though the impact on statsd seems to have been minimal in terms of udp packets/drops, though the move also... [11:09:42] 6Operations, 10DBA: Email spam from some MariaDB's logrotate - https://phabricator.wikimedia.org/T127638#2194506 (10Volans) [11:09:44] 6Operations: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2194505 (10Volans) [11:14:08] !log mw1148 powercycled, not responsive to ssh [11:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:16:05] RECOVERY - nutcracker process on mw1148 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [11:16:34] RECOVERY - RAID on mw1148 is OK: OK: no RAID installed [11:16:35] <_joe_> elukey: again oom? [11:16:44] RECOVERY - salt-minion processes on mw1148 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:16:54] RECOVERY - Check size of conntrack table on mw1148 is OK: OK: nf_conntrack is 0 % full [11:17:05] RECOVERY - Apache HTTP on mw1148 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 1.743 second response time [11:17:25] RECOVERY - DPKG on mw1148 is OK: All packages OK [11:17:44] RECOVERY - Disk space on mw1148 is OK: DISK OK [11:17:44] RECOVERY - nutcracker port on mw1148 is OK: TCP OK - 0.000 second response time on port 11212 [11:17:44] RECOVERY - dhclient process on mw1148 is OK: PROCS OK: 0 processes with command name dhclient [11:17:54] RECOVERY - SSH on mw1148 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [11:17:54] RECOVERY - configured eth on mw1148 is OK: OK - interfaces up [11:17:55] RECOVERY - HHVM rendering on mw1148 is OK: HTTP OK: HTTP/1.1 200 OK - 74370 bytes in 0.285 second response time [11:17:55] RECOVERY - HHVM processes on mw1148 is OK: PROCS OK: 6 processes with command name hhvm [11:18:01] 6Operations, 7Graphite: jobrunner should send statsd in batches - https://phabricator.wikimedia.org/T132327#2194525 (10fgiunchedi) [11:18:23] _joe_: I was about to check [11:18:42] 6Operations, 7Graphite, 13Patch-For-Review: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#1330890 (10fgiunchedi) [11:18:44] 6Operations, 10MediaWiki-General-or-Unknown, 7Graphite, 5MW-1.27-release-notes, 13Patch-For-Review: mediawiki should send statsd metrics in batches - https://phabricator.wikimedia.org/T116031#1738764 (10fgiunchedi) 5Open>3Resolved see {T132327} for jobrunner, mediawiki is done [11:20:14] RECOVERY - puppet last run on mw1148 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [11:22:26] 6Operations, 10Dumps-Generation, 7HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2194552 (10ArielGlenn) Some investigation from the command line shows me that these messages for dumps are only produced when there's a Segfault causing the p... [11:24:24] _joe_ https://grafana.wikimedia.org/dashboard/db/server-board?panelId=14&fullscreen (mw1148) looks weird, steady increase. I can't see now anything useful on dmesg/logs [11:25:30] <_joe_> elukey: that is called "HHVM has a memory leak" [11:27:04] _joe_ all right thanks :D [11:27:53] <_joe_> :/ [11:39:27] (03PS1) 10Elukey: This is a test for the puppet compiler, not meant to be committed. [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/282670 [11:47:02] (03CR) 10Gehel: [C: 031] Improve robustness of es-tool [puppet] - 10https://gerrit.wikimedia.org/r/282472 (https://phabricator.wikimedia.org/T128786) (owner: 10Adedommelin) [11:48:26] PROBLEM - puppet last run on db2059 is CRITICAL: CRITICAL: puppet fail [11:53:21] 6Operations, 10Traffic, 10Wikimedia-Blog, 7HTTPS: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2194614 (10Tbayer) >>! In T105905#2191112, @jrbs wrote: > [Wordpress is now using HTTPS as default for blogs making use of their backend.](http://thenextweb.com/insider/2016/04/08/wordpre... [12:07:29] 6Operations, 10Monitoring: Refactor RAID checks (check-raid) - https://phabricator.wikimedia.org/T84050#2194625 (10Jgreen) >>! In T84050#1467111, @fgiunchedi wrote: > I did ask the same question to Jeff without remembering this ticket, anyways for reference I'm attaching it {F203780} Here's the whole frack mo... [12:09:16] (03PS1) 10Alexandros Kosiaris: Introduce meitnerium.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/282674 (https://phabricator.wikimedia.org/T131358) [12:13:08] (03PS1) 10Alexandros Kosiaris: Introduce meitnerium.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/282676 (https://bugzilla.wikimedia.org/131358) [12:14:44] RECOVERY - puppet last run on db2059 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [12:14:47] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce meitnerium.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/282674 (https://phabricator.wikimedia.org/T131358) (owner: 10Alexandros Kosiaris) [12:28:54] (03PS1) 10Muehlenhoff: Add code to handle legacy binaries [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/282680 [12:31:29] 6Operations, 7Graphite, 13Patch-For-Review: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#2194657 (10fgiunchedi) I took a one-minute survey of statsd traffic to get top users ```lines=5 graphite1001:~$ sudo timeout 1m ngrep -q -W byline . udp dst port 8125 | gre... [12:36:55] (03CR) 10Gehel: "We need to increase discovery.zen.minimum_master_nodes to 4 if we want to deploy this." [puppet] - 10https://gerrit.wikimedia.org/r/251024 (https://phabricator.wikimedia.org/T112556) (owner: 10EBernhardson) [12:37:00] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add code to handle legacy binaries [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/282680 (owner: 10Muehlenhoff) [12:40:21] (03CR) 10Gehel: "I now see https://gerrit.wikimedia.org/r/#/c/251025/, the goal was not to increase the number of masters, but to replace the old by new on" [puppet] - 10https://gerrit.wikimedia.org/r/251024 (https://phabricator.wikimedia.org/T112556) (owner: 10EBernhardson) [12:43:58] (03PS2) 10Muehlenhoff: Enable base::firewall on labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/282344 [12:48:23] (03PS1) 10Volans: MariaDB: use Puppet certs for s5 cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/282684 (https://phabricator.wikimedia.org/T111654) [12:56:51] 6Operations, 10Dumps-Generation, 7HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2194702 (10ArielGlenn) I suspect a bad interaction with hhvm/bzip2 (compress streams)/XMLReader but have yet to be able to isolate it. Still investigating. [12:59:33] 6Operations, 10CirrusSearch, 6Discovery, 6Discovery-Search-Backlog, 13Patch-For-Review: Only use newer (elastic10{16..31}) servers as master capable elasticsearch nodes - https://phabricator.wikimedia.org/T112556#2194704 (10Gehel) According to racktables, we have new elasticsearch hardware in row A and D... [13:00:13] 6Operations, 6Discovery, 10hardware-requests: Refresh elastic10{01..16}.eqiad.wmnet servers - https://phabricator.wikimedia.org/T128000#2060818 (10Gehel) [13:00:15] 6Operations, 10CirrusSearch, 6Discovery, 6Discovery-Search-Backlog, 13Patch-For-Review: Only use newer (elastic10{16..31}) servers as master capable elasticsearch nodes - https://phabricator.wikimedia.org/T112556#2194708 (10Gehel) [13:00:56] (03CR) 10Gehel: [C: 04-1] "Servers do not seem to be in the correct row, let's wait until this is sorted out to merge this." [puppet] - 10https://gerrit.wikimedia.org/r/251024 (https://phabricator.wikimedia.org/T112556) (owner: 10EBernhardson) [13:01:06] (03PS1) 10Muehlenhoff: Fix argument parsing [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/282686 [13:01:17] (03CR) 10Muehlenhoff: [C: 032 V: 032] Fix argument parsing [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/282686 (owner: 10Muehlenhoff) [13:05:45] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [5000000.0] [13:07:55] (03CR) 10EBernhardson: "depending on cluster load just retrying might not always work. I havn't experienced it with the settings, but other cluster state changes" [puppet] - 10https://gerrit.wikimedia.org/r/282472 (https://phabricator.wikimedia.org/T128786) (owner: 10Adedommelin) [13:08:25] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [5000000.0] [13:13:08] 6Operations, 6Discovery, 10hardware-requests: Refresh elastic10{01..16}.eqiad.wmnet servers - https://phabricator.wikimedia.org/T128000#2194724 (10Gehel) Note: we want to have 3 elasticsearch node to be master eligible and we'd like them to be in different rows (for obvious reasons). @Gehel will check with @... [13:19:08] (03CR) 10Volans: "changes looks good: https://puppet-compiler.wmflabs.org/2390/" [puppet] - 10https://gerrit.wikimedia.org/r/282684 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [13:19:35] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [13:20:35] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [13:21:15] 6Operations, 10Wikimedia-Apache-configuration: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#2194732 (10jrbs) [13:29:42] (03PS2) 10Alexandros Kosiaris: Introduce meitnerium.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/282676 (https://bugzilla.wikimedia.org/131358) [13:33:52] 6Operations, 10Monitoring, 7Graphite, 7HHVM, 13Patch-For-Review: check_graphite - "UNKNOWN: More than half of the datapoints are undefined " - https://phabricator.wikimedia.org/T105218#2194770 (10fgiunchedi) 5Open>3Resolved ATM there's two outstanding UNKNOWN with "no valid datapoints found", all act... [13:35:21] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::firewall on labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/282344 (owner: 10Muehlenhoff) [13:37:58] (03PS3) 10Andrew Bogott: In our libvirt hack, rename libvirt_images_type to images_type [puppet] - 10https://gerrit.wikimedia.org/r/281683 (https://phabricator.wikimedia.org/T131322) [13:45:01] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall on labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/282344 (owner: 10Muehlenhoff) [13:46:25] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:26] (03PS1) 10Faidon Liambotis: redirects: update stopsurveillance target URL [puppet] - 10https://gerrit.wikimedia.org/r/282692 (https://phabricator.wikimedia.org/T97341) [13:46:35] PROBLEM - HHVM rendering on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:45] (03PS2) 10Faidon Liambotis: redirects: update stopsurveillance target URL [puppet] - 10https://gerrit.wikimedia.org/r/282692 (https://phabricator.wikimedia.org/T97341) [13:47:49] (03CR) 10Faidon Liambotis: [C: 032 V: 032] redirects: update stopsurveillance target URL [puppet] - 10https://gerrit.wikimedia.org/r/282692 (https://phabricator.wikimedia.org/T97341) (owner: 10Faidon Liambotis) [13:48:34] PROBLEM - dhclient process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:48:45] PROBLEM - nutcracker process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:48:45] PROBLEM - RAID on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:48:46] PROBLEM - Check size of conntrack table on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:48:54] PROBLEM - puppet last run on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:48:55] PROBLEM - salt-minion processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:49:03] (03PS4) 10Andrew Bogott: In our libvirt hack, rename libvirt_images_type to images_type [puppet] - 10https://gerrit.wikimedia.org/r/281683 (https://phabricator.wikimedia.org/T131322) [13:49:14] PROBLEM - configured eth on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:49:16] PROBLEM - DPKG on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:49:24] PROBLEM - SSH on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:50:06] another OOM ? [13:50:54] (03CR) 10Andrew Bogott: [C: 032] In our libvirt hack, rename libvirt_images_type to images_type [puppet] - 10https://gerrit.wikimedia.org/r/281683 (https://phabricator.wikimedia.org/T131322) (owner: 10Andrew Bogott) [13:51:04] PROBLEM - nutcracker port on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:51:14] PROBLEM - Disk space on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:51:35] PROBLEM - HHVM processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:52:25] PROBLEM - DPKG on labmon1001 is CRITICAL: Timeout while attempting connection [13:52:44] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: Connection timed out [13:52:54] PROBLEM - salt-minion processes on labmon1001 is CRITICAL: Timeout while attempting connection [13:53:13] elukey, _joe_: seems we have another node leaking memory. I don't have actual metrics, but it seems like a few too many nodes having issues for today [13:53:15] PROBLEM - dhclient process on labmon1001 is CRITICAL: Timeout while attempting connection [13:53:24] PROBLEM - configured eth on labmon1001 is CRITICAL: Timeout while attempting connection [13:53:34] moritzm: ^ you? [13:53:44] PROBLEM - Disk space on labmon1001 is CRITICAL: Timeout while attempting connection [13:53:45] PROBLEM - RAID on labmon1001 is CRITICAL: Timeout while attempting connection [13:53:54] PROBLEM - graphite-web uWSGI web app on labmon1001 is CRITICAL: Timeout while attempting connection [13:54:04] PROBLEM - Graphite Carbon on labmon1001 is CRITICAL: Timeout while attempting connection [13:54:14] PROBLEM - puppet last run on labmon1001 is CRITICAL: Timeout while attempting connection [13:55:20] gehel: yeah :( [13:55:46] gehel: can you double check dmesg and reboot it? [13:56:35] elukey: can't SSH into it (already dead it seems) [13:56:45] * gehel needs to learn how to powercycle those servers... [13:57:22] elukey: any chance you could walk me through it for the first time? [13:57:28] there is a console somewhere in the ops toolbelt that doesn't go over the network, but i have no clue where :) [13:57:37] well, not over standard ssh at least [13:58:28] ebernhardson: yep, I'm searching for the doc on where it is hidden, but our search engine does not give me what I'm looking for :P [13:58:34] godog: yeah, I/O on labmon is dead-slow, so the puppet run to active ferm takes a little [13:58:47] gehel: https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Opengear_Serial_Consoles maybe? [13:58:53] mobrovac: any ideas why on i/o? [13:58:55] RECOVERY - configured eth on labmon1001 is OK: OK - interfaces up [13:58:57] wrong person ^ :) [13:59:06] RECOVERY - Disk space on labmon1001 is OK: DISK OK [13:59:07] moritzm: any ideas on why for i/o sluggishness there? [13:59:15] RECOVERY - RAID on labmon1001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [13:59:21] now complete [13:59:24] RECOVERY - graphite-web uWSGI web app on labmon1001 is OK: uwsgi-graphite-web start/running, process 41564 [13:59:34] RECOVERY - Graphite Carbon on labmon1001 is OK: OK: All defined Carbon jobs are runnning. [13:59:42] chasemp: needs further ibvestigation, tracked at https://phabricator.wikimedia.org/T127957 [13:59:45] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [13:59:55] RECOVERY - DPKG on labmon1001 is OK: All packages OK [14:00:25] RECOVERY - salt-minion processes on labmon1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:00:45] RECOVERY - dhclient process on labmon1001 is OK: PROCS OK: 0 processes with command name dhclient [14:02:49] (03PS3) 10Elukey: Add complete submodule support to the puppet compiler. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/282652 (https://phabricator.wikimedia.org/T132154) [14:03:38] gehel: sure! [14:03:50] pvt [14:04:22] I think the tl;dr is 'graphite' [14:04:32] <_joe_> ok let's see what I am going to break [14:04:44] (03PS3) 10Giuseppe Lavagetto: redis: add monitoring define [puppet] - 10https://gerrit.wikimedia.org/r/282383 [14:05:02] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1686 bytes in 1.199 second response time [14:08:16] (03CR) 10Giuseppe Lavagetto: [C: 032] redis: add monitoring define [puppet] - 10https://gerrit.wikimedia.org/r/282383 (owner: 10Giuseppe Lavagetto) [14:09:14] !log mw1140 powercycled (ssh not usable, can't access as root via console too) [14:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:10:45] 6Operations, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#2194883 (10faidon) I've updated the target URL to the requested one; it will take a while to be fully deployed (up to 30 minutes for the initial deploy, then after that i... [14:11:23] RECOVERY - salt-minion processes on mw1140 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:11:41] RECOVERY - RAID on mw1140 is OK: OK: no RAID installed [14:11:43] RECOVERY - SSH on mw1140 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [14:11:55] good boy [14:12:01] RECOVERY - configured eth on mw1140 is OK: OK - interfaces up [14:12:21] RECOVERY - dhclient process on mw1140 is OK: PROCS OK: 0 processes with command name dhclient [14:12:21] RECOVERY - Check size of conntrack table on mw1140 is OK: OK: nf_conntrack is 0 % full [14:12:23] RECOVERY - nutcracker process on mw1140 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [14:12:32] RECOVERY - nutcracker port on mw1140 is OK: TCP OK - 0.000 second response time on port 11212 [14:12:39] (03PS1) 10Giuseppe Lavagetto: redis::monitoring::instance: fix inclusion of class [puppet] - 10https://gerrit.wikimedia.org/r/282697 [14:12:42] RECOVERY - HHVM processes on mw1140 is OK: PROCS OK: 6 processes with command name hhvm [14:12:42] RECOVERY - DPKG on mw1140 is OK: All packages OK [14:12:51] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 55 minutes ago with 0 failures [14:13:01] RECOVERY - Disk space on mw1140 is OK: DISK OK [14:13:12] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 74118 bytes in 0.427 second response time [14:13:22] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] redis::monitoring::instance: fix inclusion of class [puppet] - 10https://gerrit.wikimedia.org/r/282697 (owner: 10Giuseppe Lavagetto) [14:13:22] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.060 second response time [14:17:31] _joe_: a bit of precise holdover not sure how you would prefer to handle it, https://phabricator.wikimedia.org/rOPUP90e56fa1ad9d3033b248cba1357751f72ab83e38 seems to have broken puppet on all precise exec nodes in Tools [14:17:37] 6Operations, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#2194893 (10faidon) Also see T44085, which tracks progress for the implementation of a URL-shortener domain (we have already gotten `w.wiki` for this purpose). [14:18:52] <_joe_> chasemp: there is already a fix in mediawiki::packages [14:19:00] <_joe_> chasemp: what error are you seeing exactly? [14:19:47] _joe_: https://phabricator.wikimedia.org/T132282 [14:20:10] my first thought was a weirdness in os_version but yeah not sure [14:21:10] <_joe_> oh shit, the fonts too? [14:21:47] <_joe_> chasemp: why do we have precise runners on tools? [14:22:00] 6Operations, 7Graphite, 13Patch-For-Review: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#2194903 (10fgiunchedi) [14:22:02] 6Operations, 10Monitoring, 13Patch-For-Review: switch diamond to use graphite line protocol - https://phabricator.wikimedia.org/T121861#2194901 (10fgiunchedi) 5Resolved>3Open reopening as there seem to be missing ACLs from hosts with public IPs towards graphite1001:2003 tcp (e.g. carbon) [14:22:44] the sort answer is because we have never gotten rid of them, the long answer is something along the lines of it's much harder to deprecate in Tools/Labs than prod [14:22:57] precise will phase out as we get rid of SGE [14:23:06] though there has never been explicit arrangement that I know of [14:23:20] <_joe_> chasemp: so I'd say you create a separate class to use in toollabs [14:23:34] paravoid: would you have a minute for missing acls for diamond? https://phabricator.wikimedia.org/T121861#2194901 [14:24:22] _joe_: that seems like unending trouble doesn't it? [14:24:38] maybe we can fiat whatever needs this has to be on trusty [14:24:45] <_joe_> chasemp: supporting precise in mediawiki because of an unrelated use in toollabs? [14:24:53] <_joe_> I agree, it would be an unending trouble [14:24:55] <_joe_> :P [14:25:29] <_joe_> my point is - you're not using that class to run mediawiki [14:25:48] (03PS1) 10Andrew Bogott: Spreadcheck: Exclude non-active instances from check [puppet] - 10https://gerrit.wikimedia.org/r/282701 (https://phabricator.wikimedia.org/T119929) [14:25:54] <_joe_> and we can't really support both precise and jessie without transforming that class in a nested mess [14:26:26] <_joe_> so I'd rather keep a separated class with all the fonts needed for a toollabs worker [14:26:35] <_joe_> they might not be the same as the mediawiki list [14:26:56] <_joe_> it's really better than having unlogic dependencies between modules [14:26:59] godog: sec [14:27:20] _joe_: part of it may be people have run mediawiki in tools historically [14:27:27] though seldom and it's mostly a bad idea [14:27:59] (03PS1) 10Muehlenhoff: Fix handling of unspecified distro releases [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/282702 [14:28:19] I'll look into the why of it [14:30:54] (03CR) 10Muehlenhoff: [C: 032 V: 032] Fix handling of unspecified distro releases [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/282702 (owner: 10Muehlenhoff) [14:34:38] (03PS1) 10Giuseppe Lavagetto: redis::monitoring: fixup of passing $port along [puppet] - 10https://gerrit.wikimedia.org/r/282704 [14:35:17] godog: not an ACL issue [14:35:22] iptables probably [14:35:45] root@graphite1001:~# iptables -nvxL | grep 2003 [14:35:45] 655397 39323820 ACCEPT tcp -- * * 10.0.0.0/8 0.0.0.0/0 tcp dpt:2003 [14:35:48] 0 0 ACCEPT udp -- * * 10.0.0.0/8 0.0.0.0/0 udp dpt:2003 [14:35:51] yup [14:36:22] which is also very wrong from the opposite side as well [14:36:31] (this includes Labs) [14:37:28] also see https://gerrit.wikimedia.org/r/#/c/260926/ [14:38:19] paravoid: sigh, thanks, fixing [14:39:37] 6Operations, 6Labs, 13Patch-For-Review, 15User-bd808: Setting up bulk proxies pointing to a multiwiki mediawiki-vagrant setup running on a labs vm - https://phabricator.wikimedia.org/T132216#2194966 (10faidon) [14:40:17] (03PS2) 10Giuseppe Lavagetto: redis::monitoring: fixup and some more parameter checks [puppet] - 10https://gerrit.wikimedia.org/r/282704 [14:42:12] (03PS3) 10Giuseppe Lavagetto: redis::monitoring: fixup and some more parameter checks [puppet] - 10https://gerrit.wikimedia.org/r/282704 [14:43:11] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] redis::monitoring: fixup and some more parameter checks [puppet] - 10https://gerrit.wikimedia.org/r/282704 (owner: 10Giuseppe Lavagetto) [14:45:26] (03PS1) 10Giuseppe Lavagetto: redis::monitoring: removed double declaration of $port [puppet] - 10https://gerrit.wikimedia.org/r/282705 [14:45:47] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] redis::monitoring: removed double declaration of $port [puppet] - 10https://gerrit.wikimedia.org/r/282705 (owner: 10Giuseppe Lavagetto) [14:47:37] (03PS1) 10Filippo Giunchedi: graphite: permit line protocol traffic from ALL_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/282706 (https://phabricator.wikimedia.org/T121861) [14:47:46] Labs down :/ [14:48:51] (or at least soem tools :/) [14:49:06] <_joe_> chasemp: ^^ [14:49:10] seems my connection a bastion is not working [14:49:18] hmm..seems ok now...nvm [14:50:29] (03PS1) 10Giuseppe Lavagetto: nagios_common::check::redis: create file, not a directory [puppet] - 10https://gerrit.wikimedia.org/r/282707 [14:51:24] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] nagios_common::check::redis: create file, not a directory [puppet] - 10https://gerrit.wikimedia.org/r/282707 (owner: 10Giuseppe Lavagetto) [14:51:58] (03CR) 10Muehlenhoff: [C: 031] graphite: permit line protocol traffic from ALL_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/282706 (https://phabricator.wikimedia.org/T121861) (owner: 10Filippo Giunchedi) [14:52:48] (03PS1) 10Giuseppe Lavagetto: redis::monitoring::instance: s/instance_name/port/ [puppet] - 10https://gerrit.wikimedia.org/r/282708 [14:52:57] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/3: down - TiNet {#1065}BR [14:53:14] (03PS2) 10Filippo Giunchedi: graphite: permit line protocol traffic from ALL_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/282706 (https://phabricator.wikimedia.org/T121861) [14:53:16] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] redis::monitoring::instance: s/instance_name/port/ [puppet] - 10https://gerrit.wikimedia.org/r/282708 (owner: 10Giuseppe Lavagetto) [14:53:33] (03CR) 10Gehel: "I did not have any issue when restarting eqiad last time (and retrying until it was successful). But increasing timeout or providing an op" [puppet] - 10https://gerrit.wikimedia.org/r/282472 (https://phabricator.wikimedia.org/T128786) (owner: 10Adedommelin) [14:53:35] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: permit line protocol traffic from ALL_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/282706 (https://phabricator.wikimedia.org/T121861) (owner: 10Filippo Giunchedi) [14:53:39] 6Operations, 10Dumps-Generation, 7HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2195016 (10ArielGlenn) I've got a tiny program that fails, now trying to get the equivalent with gzip set up for comparison. [14:53:54] (03PS3) 10Filippo Giunchedi: graphite: permit line protocol traffic from ALL_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/282706 (https://phabricator.wikimedia.org/T121861) [14:54:04] (03CR) 10Filippo Giunchedi: [V: 032] graphite: permit line protocol traffic from ALL_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/282706 (https://phabricator.wikimedia.org/T121861) (owner: 10Filippo Giunchedi) [14:54:37] PROBLEM - Redis status on mc2001 is CRITICAL: No such file or directory at /usr/lib/nagios/plugins/check_redis line 2483. [14:55:51] <_joe_> that's my fault ^^ [14:55:57] PROBLEM - Redis status on mc1001 is CRITICAL: No such file or directory at /usr/lib/nagios/plugins/check_redis line 2483. [14:58:39] PROBLEM - Auth DNS for labs pdns on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [14:59:33] <_joe_> this is not me [14:59:53] (03PS3) 10Alexandros Kosiaris: Introduce meitnerium.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/282676 (https://bugzilla.wikimedia.org/131358) [15:00:00] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Introduce meitnerium.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/282676 (https://bugzilla.wikimedia.org/131358) (owner: 10Alexandros Kosiaris) [15:00:04] anomie ostriches thcipriani marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160411T1500). [15:00:04] Urbanecm James_F bearND matt_flaschen yurik: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:00:47] andrewbogott: about? labs dns issues [15:01:11] I can SWAT, who's around for SWAT? [15:01:20] Me [15:01:33] RECOVERY - Redis status on mc2001 is OK: OK: REDIS on 10.192.0.34:6380 has 1 databases (db0) with 145753 keys [15:01:35] * James_F waves. [15:01:44] chasemp: wait, a new one, or just the brief one that happened 10 minutes ago? [15:01:51] (03PS4) 10Elukey: Add complete submodule support to the puppet compiler. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/282652 (https://phabricator.wikimedia.org/T132154) [15:01:53] Hi [15:01:53] andrewbogott: now [15:01:56] (03PS8) 10Thcipriani: Add new namespaces and new aliases for newikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281443 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm) [15:02:09] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281443 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm) [15:02:10] Present [15:02:47] ok, restarting pdns everywhere... [15:03:03] (03Merged) 10jenkins-bot: Add new namespaces and new aliases for newikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281443 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm) [15:03:45] 6Operations, 10Dumps-Generation, 7HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2195068 (10ArielGlenn) This runs successfully with a 0 exit code: 6Operations, 10Gerrit, 10Mail, 7Upstream: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2195070 (10greg) @Jdforrester-WMF reports that he is experiencing the same symptoms today (lack of receiving email from Gerrit, that is :) ). [15:04:54] PROBLEM - Auth DNS for labs pdns on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [15:05:11] 6Operations, 10Gerrit, 10Mail, 7Upstream: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2195072 (10greg) [15:05:56] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Add new namespaces and new aliases for newikibooks [[gerrit:281443]] (duration: 00m 45s) [15:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:06:11] ^ Urbanecm check please [15:06:57] It seems that it works. [15:07:13] (03PS2) 10Thcipriani: Add flood group to ladwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282201 (https://phabricator.wikimedia.org/T131527) (owner: 10Urbanecm) [15:07:40] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282201 (https://phabricator.wikimedia.org/T131527) (owner: 10Urbanecm) [15:08:10] (03Merged) 10jenkins-bot: Add flood group to ladwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282201 (https://phabricator.wikimedia.org/T131527) (owner: 10Urbanecm) [15:08:30] andrewbogott: yeah dns still seems to be flapping [15:08:49] chasemp: it's not fixed now? [15:09:02] oh, so I see [15:09:04] 6Operations, 10Dumps-Generation, 7HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2195077 (10ArielGlenn) {F3862019} This is the sample xml file I used. So hhvm somehow doesn't clean up properly compress.bzip2 streams if the php process lea... [15:10:21] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Add flood group to ladwiki [[gerrit:282201]] (duration: 00m 29s) [15:10:23] ^ Urbanecm check please [15:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:10:54] (03PS2) 10Thcipriani: Enable VisualEditor on the Project ('Wikipedya') of htwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281263 (https://phabricator.wikimedia.org/T130177) (owner: 10Jforrester) [15:10:57] It works. [15:10:59] RECOVERY - Redis status on mc1001 is OK: OK: REDIS on 10.64.0.180:6379 has 1 databases (db0) with 141956 keys [15:11:02] Thanks for deploys. [15:11:09] Urbanecm: thanks for checking! [15:11:23] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281263 (https://phabricator.wikimedia.org/T130177) (owner: 10Jforrester) [15:11:30] andrewbogott: it's pretty erratic? [15:11:56] (03Merged) 10jenkins-bot: Enable VisualEditor on the Project ('Wikipedya') of htwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281263 (https://phabricator.wikimedia.org/T130177) (owner: 10Jforrester) [15:12:10] (03PS5) 10Elukey: Add complete submodule support to the puppet compiler. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/282652 (https://phabricator.wikimedia.org/T132154) [15:13:08] RECOVERY - Auth DNS for labs pdns on labs-ns0.wikimedia.org is OK: DNS OK: 0.055 seconds response time. nagiostest.eqiad.wmflabs returns [15:13:39] RECOVERY - Auth DNS for labs pdns on labs-ns1.wikimedia.org is OK: DNS OK: 0.065 seconds response time. nagiostest.eqiad.wmflabs returns [15:13:47] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable VisualEditor on the Project ("Wikipedya") of htwiki [[gerrit:281263]] (duration: 00m 32s) [15:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:57] ^ James_F check please [15:14:12] Hmm. [15:14:23] andrewbogott: pdns seems to know it's not answering and is timing out queries [15:14:26] is this a backend issue? [15:15:02] 6Operations, 7Graphite, 13Patch-For-Review: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#2195087 (10fgiunchedi) [15:15:04] 6Operations, 10Monitoring, 13Patch-For-Review: switch diamond to use graphite line protocol - https://phabricator.wikimedia.org/T121861#2195085 (10fgiunchedi) 5Open>3Resolved as it turns out it wasn't ACLs but iptables, now fixed [15:15:40] thcipriani: For some reason it's not working. Everything else looks OK… [15:15:49] chasemp: I'm using 'watch' to monitor the responses from each recursor and auth server [15:15:57] everything looks stable at the moment, but keeping an eye on it [15:16:15] andrewbogott: it stopped handing out servfail about a minute ago or at least I haven't seen one for a minute [15:16:16] chasemp: are you still seeing bad behavior? [15:16:22] thcipriani: Aha, working now. [15:16:26] thcipriani: Maybe a cache. [15:16:27] but it's been totally hot cold for maybe 10-15+ [15:16:28] thcipriani: Proceed. [15:16:34] James_F: cool, thanks! [15:16:49] * thcipriani was doublechecking that I actually pulled down the change :P [15:16:59] * James_F grins. [15:17:05] (03PS2) 10Thcipriani: [Cleanup] Remove VisualEditor experimental config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280869 (owner: 10Jforrester) [15:17:08] thcipriani: I blame HHVM. It's an easy target. [15:17:21] 6Operations, 10ops-eqiad, 6Analytics-Kanban: Analytics1039 host showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2195105 (10elukey) @Southparkfan: thanks for the info! This is the first host that explicitly shows thermal errors, meanwhile the other one just rebooted for some reason... [15:17:29] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280869 (owner: 10Jforrester) [15:17:54] (03Merged) 10jenkins-bot: [Cleanup] Remove VisualEditor experimental config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280869 (owner: 10Jforrester) [15:18:18] chasemp: I'll keep an eye on it, but for the moment I'm still thinking that we should do https://phabricator.wikimedia.org/T128737 and related tasks before thinking too hard about this [15:19:16] andrewbogott: this makes me pretty nervous, I do still see failures but related to a few hosts and I don't know if that pre-existed [15:19:49] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [15:20:01] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [Cleanup] Remove VisualEditor experimental config 1 of 2 [[gerrit:280869]] (duration: 00m 28s) [15:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:20:42] 6Operations, 10Gerrit, 10Mail, 7Upstream: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2195116 (10demon) I've been getting gerrit e-mails this morning and I see nothing stuck in the queue. [15:20:54] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [Cleanup] Remove VisualEditor experimental config 2 of 2 [[gerrit:280869]] (duration: 00m 28s) [15:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:21:22] ^ James_F sync'd without explosion [15:21:28] 6Operations, 10Dumps-Generation, 7HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2195117 (10ArielGlenn) Woo. [15:21:44] Err. [15:21:51] (03PS2) 10Thcipriani: [Cleanup] Remove VisualEditor AutoAccountEnable config now unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280870 (owner: 10Jforrester) [15:21:54] Oh, yes, there's one more. [15:22:14] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280870 (owner: 10Jforrester) [15:22:22] indeed there is. [15:22:50] (03Merged) 10jenkins-bot: [Cleanup] Remove VisualEditor AutoAccountEnable config now unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280870 (owner: 10Jforrester) [15:23:18] PROBLEM - HHVM rendering on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:23:49] PROBLEM - Apache HTTP on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:24:37] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [Cleanup] Remove VisualEditor AutoAccountEnable config now unused [[gerrit:280870]] (duration: 00m 25s) [15:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:24:54] ^ James_F also sync'd sans explosion [15:25:06] Good-o! [15:25:09] Thank you. :-) [15:25:09] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-redis0[12], deployment-sentry2 puppet failures due to missing hiera data redis::shards - https://phabricator.wikimedia.org/T132263#2195136 (10greg) p:5Low>3Normal (not low for Beta Cluster, where this task has it's "home") :) [15:25:11] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-memc0[2-4] puppet failure due to missing hiera data mediawiki::redis_servers::codfw - https://phabricator.wikimedia.org/T132262#2195139 (10greg) p:5Lowest>3Normal (not lowest for Beta Cluster, where this task has it's "home") :) [15:26:34] 6Operations, 10Gerrit, 10Mail, 7Upstream: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2195143 (10akosiaris) The queue however is being processed normally from what I see. ``` ssh -p 29418 akosiaris@gerrit.wikimedia.org gerrit show-queue |wc -l 8 ``` so it must be s... [15:26:49] RECOVERY - HHVM rendering on mw1190 is OK: HTTP OK: HTTP/1.1 200 OK - 74121 bytes in 0.287 second response time [15:27:04] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-redis0[12], deployment-sentry2 puppet failures due to missing hiera data redis::shards - https://phabricator.wikimedia.org/T132263#2195144 (10Krenair) @Joe: Didn't your commit linked above cause this error? [15:27:11] !log restarted hhvm on mw1190 (hhvm-dump-debug saved in /tmp/hhvm.19289.bt) [15:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:29] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.044 second response time [15:27:31] (03PS2) 10Thcipriani: Enable Echo survey on French-language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282414 (https://phabricator.wikimedia.org/T131893) (owner: 10Mattflaschen) [15:27:48] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282414 (https://phabricator.wikimedia.org/T131893) (owner: 10Mattflaschen) [15:27:51] blame to HPHP::Treadmill::getAgeOldestRequest [15:28:24] (03Merged) 10jenkins-bot: Enable Echo survey on French-language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282414 (https://phabricator.wikimedia.org/T131893) (owner: 10Mattflaschen) [15:28:29] 6Operations, 10Gerrit, 10Mail, 7Upstream: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2195146 (10greg) Sorry for the false-alarm, it was all in his Spam folder :) (which is weird, but, definitely not because of this task...) [15:28:32] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-memc0[2-4] puppet failure due to missing hiera data mediawiki::redis_servers::codfw - https://phabricator.wikimedia.org/T132262#2195147 (10Krenair) @Joe: Again, didn't one of your commits linked above cause this error? [15:29:48] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable Echo survey on French-language wikis [[gerrit:282414]] (duration: 00m 25s) [15:29:52] ^ matt_flaschen check please [15:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:12] 6Operations, 6Performance-Team, 7Availability, 7Epic, and 3 others: Cleanup active-DC based MW config code and make it more robust and easy to change - https://phabricator.wikimedia.org/T114273#2195148 (10Joe) 5Open>3Resolved [15:31:53] (03Abandoned) 10ArielGlenn: Override hhvm.server.light_process_count on snapshots [puppet] - 10https://gerrit.wikimedia.org/r/282314 (owner: 10ArielGlenn) [15:33:07] thcipriani, works. [15:33:15] Tested on French Wikipedia. [15:33:28] matt_flaschen: great, thanks for checking! [15:33:58] thcipriani: you swatting again? :D [15:34:14] mafk: yes indeed :) [15:34:23] * mafk hugs [15:35:05] Except it's cut-off. [15:35:15] But that's not related to the deployment per se. [15:35:20] I'll follow up. [15:36:58] PROBLEM - MySQL Processlist on db1040 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 0 copy to table, 76 statistics [15:37:26] unfortunately expected, related to the jobqueu [15:38:49] RECOVERY - MySQL Processlist on db1040 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 2 statistics [15:40:41] !log thcipriani@tin Synchronized php-1.27.0-wmf.20/extensions/MobileApp/config/config.json: SWAT: Roll out RESTBase usage to Android production app: 100% [[gerrit:282434]] (duration: 00m 27s) [15:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:40:47] ^ bearND check please [15:42:21] thcipriani: hmm, https://meta.wikimedia.org/static/current/extensions/MobileApp/config/android.json?hdfjkhdkjfh still shows the old value of 50. I expected 100 for the last entry [15:43:09] Morning bblack...! How's it going? Quick question: do we log anywhere on production the names of cookies we're getting in the wild? Even a sample thereof? [15:43:49] thcipriani: has the sync finished? [15:44:03] (03PS1) 10Madhuvishy: analytics_cluster: Add wrapper script for beeline [puppet] - 10https://gerrit.wikimedia.org/r/282713 (https://phabricator.wikimedia.org/T116123) [15:44:28] bearND: yes indeed. [15:45:46] (03CR) 10Jforrester: [C: 04-1] "To be announced first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280002 (owner: 10Jforrester) [15:47:49] (03PS2) 10Andrew Bogott: Spreadcheck: Exclude non-active instances from check [puppet] - 10https://gerrit.wikimedia.org/r/282701 (https://phabricator.wikimedia.org/T119929) [15:48:01] 6Operations, 10Analytics, 10Traffic: cronspam from cpXXXX hosts related to varnishkafka non existent processes - https://phabricator.wikimedia.org/T132346#2195218 (10elukey) [15:48:05] (03CR) 10Andrew Bogott: [C: 032] "Yuvi approves, via IRC" [puppet] - 10https://gerrit.wikimedia.org/r/282701 (https://phabricator.wikimedia.org/T119929) (owner: 10Andrew Bogott) [15:48:45] thcipriani: is the file updated on the file system of a prod machine? [15:49:08] extensions/MobileApp/config/android.json [15:49:49] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [15:49:51] bearND: yup, just spot-checked a few, seems to be up-to-date. android is a symlink as is ios.json [15:50:38] thcipriani: yes, that's sounds correct [15:50:46] also did mwscript purgeList.php on that specific url to no avail. Seems like it may be stuck somewhere in our layers of caching :\ [15:51:25] thcipriani: yes, i did that, too. Maybe we'll give it some time. I'm currently out of ideas. [15:51:49] bearND: kk [15:51:49] thcipriani, can you roll back the Echo config patch? I decided the rendering problem is user-impacting enough I'd prefer to re-do it this evening (I know what else has to be SWATed to fix it). [15:51:51] (03PS1) 10BBlack: misc large objects refactor [puppet] - 10https://gerrit.wikimedia.org/r/282716 (https://phabricator.wikimedia.org/T128813) [15:51:57] matt_flaschen: yup. [15:52:00] Thanks [15:52:13] 6Operations: Tracking and Reducing cron-spam from root@ - https://phabricator.wikimedia.org/T132324#2195243 (10elukey) [15:52:15] 6Operations, 10Analytics, 10Traffic: cronspam from cpXXXX hosts related to varnishkafka non existent processes - https://phabricator.wikimedia.org/T132346#2195244 (10elukey) [15:53:19] (03CR) 10Jforrester: "Due to go out on 25 April." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282381 (https://phabricator.wikimedia.org/T131605) (owner: 10Greg Grossmeier) [15:54:14] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Revert "Enable Echo survey on French-language wikis" (duration: 00m 26s) [15:54:17] ^ matt_flaschen done [15:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:55:14] 7Puppet, 10Beta-Cluster-Infrastructure, 7Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2195281 (10Joe) [15:55:16] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-memc0[2-4] puppet failure due to missing hiera data mediawiki::redis_servers::codfw - https://phabricator.wikimedia.org/T132262#2195280 (10Joe) 5Open>3Resolved [15:55:26] thcipriani, thanks, I'll confirm when Varnish cache expires. [15:55:26] (03PS1) 10Thcipriani: Revert "Enable Echo survey on French-language wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282718 [15:55:35] 6Operations, 6Labs, 10Tool-Labs, 7Icinga, 13Patch-For-Review: tool labs instance distribution monitoring is broken - https://phabricator.wikimedia.org/T119929#2195282 (10Andrew) 5Open>3Resolved Test is fixed and passing. [15:56:01] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282718 (owner: 10Thcipriani) [15:56:32] (03Merged) 10jenkins-bot: Revert "Enable Echo survey on French-language wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282718 (owner: 10Thcipriani) [15:57:10] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-memc0[2-4] puppet failure due to missing hiera data mediawiki::redis_servers::codfw - https://phabricator.wikimedia.org/T132262#2192909 (10Joe) @Krenair yes, that doesn't mean I should be assigned a ticket. It's just not the way it works. [15:57:30] (03PS4) 10Thcipriani: Match JsonConfig change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282447 (owner: 10Yurik) [15:57:33] 7Puppet, 10Beta-Cluster-Infrastructure, 7Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2195289 (10Joe) [15:57:35] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-redis0[12], deployment-sentry2 puppet failures due to missing hiera data redis::shards - https://phabricator.wikimedia.org/T132263#2195288 (10Joe) 5Open>3Resolved [15:57:41] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282447 (owner: 10Yurik) [15:58:15] (03Merged) 10jenkins-bot: Match JsonConfig change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282447 (owner: 10Yurik) [15:58:33] 6Operations, 10Ops-Access-Requests: global root access for gilles - https://phabricator.wikimedia.org/T130910#2195290 (10Dzahn) >>! In T130910#2178753, @BBlack wrote: >>>! In T130910#2178741, @ori wrote: >> First of all, the request was not for global root. The task description makes it clear that the access r... [15:59:54] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: Match JsonConfig change [[gerrit:282447]] (duration: 00m 25s) [15:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:00:26] James_F: Does the "noratelimit" right overrides https://gerrit.wikimedia.org/r/#/c/280002/2/wmf-config/InitialiseSettings.php ? [16:00:35] *override [16:00:44] 6Operations, 10DBA: Job Queue growing and then running a lot of jobs at once on commonswiki - https://phabricator.wikimedia.org/T132318#2195296 (10Volans) [16:00:44] Luke|Busy: Yes. [16:00:51] ok, thx [16:01:11] Luke|Busy: So bots/sysops/stewards won't be affected [16:01:47] James_F: Ok, I just asked, because dewiki has a "limitexception" group, and I wanted to know if this group overrides the limit too, ok. Thanks [16:02:07] James_F: When do you guess is the deployment window for this? [16:02:09] * James_F nods. [16:02:22] Luke|Busy: Not this week or next. It'll be in Tech/News first. [16:02:27] (03PS5) 10Alexandros Kosiaris: contint: set pbuilder basepath to actual directory [puppet] - 10https://gerrit.wikimedia.org/r/269103 (https://phabricator.wikimedia.org/T125999) (owner: 10Hashar) [16:02:29] ok [16:02:35] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] contint: set pbuilder basepath to actual directory [puppet] - 10https://gerrit.wikimedia.org/r/269103 (https://phabricator.wikimedia.org/T125999) (owner: 10Hashar) [16:03:10] PROBLEM - aqs endpoints health on aqs1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:03:18] PROBLEM - aqs endpoints health on aqs1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:03:23] (03PS2) 10Alexandros Kosiaris: Adds russian myspell package to ores base. [puppet] - 10https://gerrit.wikimedia.org/r/282408 (owner: 10Halfak) [16:03:29] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Adds russian myspell package to ores base. [puppet] - 10https://gerrit.wikimedia.org/r/282408 (owner: 10Halfak) [16:04:36] thcipriani, revert tested. Thanks. [16:04:50] RECOVERY - aqs endpoints health on aqs1002 is OK: All endpoints are healthy [16:04:52] matt_flaschen: cool, thanks for checking. [16:04:59] RECOVERY - aqs endpoints health on aqs1003 is OK: All endpoints are healthy [16:05:59] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-redis0[12], deployment-sentry2 puppet failures due to missing hiera data redis::shards - https://phabricator.wikimedia.org/T132263#2195340 (10Krenair) For the record it looks like this was fixed in https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deploym... [16:06:08] (03Restored) 10Jforrester: [WIP] Make VisualEditor access RESTbase directly on private wikis too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200107 (owner: 10Jforrester) [16:06:16] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-memc0[2-4] puppet failure due to missing hiera data mediawiki::redis_servers::codfw - https://phabricator.wikimedia.org/T132262#2195342 (10Krenair) For the record it looks like this was fixed in https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment... [16:06:31] (03PS3) 10Yuvipanda: Adding gehel to some shinken notifications for labs [puppet] - 10https://gerrit.wikimedia.org/r/270729 (owner: 10Gehel) [16:06:38] (03CR) 10Yuvipanda: [C: 032 V: 032] Adding gehel to some shinken notifications for labs [puppet] - 10https://gerrit.wikimedia.org/r/270729 (owner: 10Gehel) [16:06:49] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:07:54] where the data from MediaWiki:Autoblock whitelist comes from? [16:08:00] (03PS2) 10Yuvipanda: toollabs: install goaccess on webproxy hosts [puppet] - 10https://gerrit.wikimedia.org/r/260566 (https://phabricator.wikimedia.org/T121233) (owner: 10Merlijn van Deen) [16:08:16] mafk: Either local or from translatewiki [16:08:17] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: install goaccess on webproxy hosts [puppet] - 10https://gerrit.wikimedia.org/r/260566 (https://phabricator.wikimedia.org/T121233) (owner: 10Merlijn van Deen) [16:08:33] English default is probably defined in core, maybe overwritten in WikimediaMessages (but unlikely) [16:08:51] hoo: but the page already contains some IPs and ranges, I though it was set in core [16:09:13] at meta or at which wiki? [16:09:29] (03PS3) 10Yuvipanda: Remove special case handling for labs realm [puppet] - 10https://gerrit.wikimedia.org/r/264264 (owner: 10Muehlenhoff) [16:09:36] (03CR) 10Yuvipanda: [C: 032 V: 032] Remove special case handling for labs realm [puppet] - 10https://gerrit.wikimedia.org/r/264264 (owner: 10Muehlenhoff) [16:10:06] (03Restored) 10Jforrester: Increase default thumbnail display size from 220px to 300px [mediawiki-config] - 10https://gerrit.wikimedia.org/r/154408 (https://bugzilla.wikimedia.org/67709) (owner: 10Jforrester) [16:10:28] Luke|Busy: the default content [16:10:51] (03CR) 10Jforrester: [C: 031] "Let's get this scheduled." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281976 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [16:11:00] mafk: In that case WikimediaMessages, probably [16:11:13] hoo: Is translatewiki using wikimedia messages? [16:11:24] Luke|Busy: No, but WikimediaMessages is using twn [16:11:35] then it's default: [16:11:36] WikimediaMessages provides message overrides fro Wikimedia specifc things [16:11:38] https://translatewiki.net/wiki/MediaWiki:Autoblock_whitelist [16:12:20] so it's core [16:12:22] https://translatewiki.net/w/i.php?title=MediaWiki:Autoblock_whitelist&action=history <-- weird [16:12:29] yep, seems so [16:12:29] Luke|Busy: That doesn't mean it's core [16:12:36] (but it still is core, just checked) [16:12:45] (03PS2) 10Yuvipanda: Flake8 for toollabs [puppet] - 10https://gerrit.wikimedia.org/r/279895 (owner: 10Ladsgroup) [16:12:49] WikimediaMessages are also translated via twn [16:12:52] (03CR) 10Yuvipanda: [C: 032 V: 032] Flake8 for toollabs [puppet] - 10https://gerrit.wikimedia.org/r/279895 (owner: 10Ladsgroup) [16:13:07] which part of core hoo? [16:13:25] hm... MediaWiki/core [16:13:34] mafk: Do you want to add IPs or what do you want to do? [16:13:48] Luke|Busy: exploring [16:13:55] ok :) [16:15:57] thcipriani: Why https://gerrit.wikimedia.org/r/#/c/282718/? What was wrong? [16:16:32] Luke|Busy: matt_flaschen said that some text was getting cut off, asked for a revert. [16:16:40] ok, thx [16:16:48] 6Operations, 10DBA, 6Performance-Team: Job Queue growing and then running a lot of jobs at once on commonswiki - https://phabricator.wikimedia.org/T132318#2195383 (10Volans) a:5Volans>3None [16:17:21] 6Operations, 10ops-eqiad, 10Traffic: investigate radon crash - https://phabricator.wikimedia.org/T131053#2195388 (10BBlack) 5Open>3Resolved a:3BBlack This has been stable since the thermal paste fix, resolving! [16:19:16] 6Operations, 10Ops-Access-Requests: root access on swift machines for gilles - https://phabricator.wikimedia.org/T130910#2195403 (10Dzahn) [16:19:46] 6Operations, 10Ops-Access-Requests: root access on swift machines for gilles - https://phabricator.wikimedia.org/T130910#2150515 (10Dzahn) @ori renamed again, please let me know if this seems accurate or feel free to rename it as you see fit [16:25:20] 6Operations, 10media-storage: Unable to delete, restore/undelete, move or upload new versions of files on several wikis ("inconsistent state within the internal storage backends") - https://phabricator.wikimedia.org/T128096#2195462 (10aaron) >>! In T128096#2157724, @fgiunchedi wrote: > leaving this open until... [16:25:39] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:34:53] bblack: yt? :) [16:35:40] AndyRussG: we are in our big meeting atm :) [16:35:51] chasemp: ah sorry! [16:37:18] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [16:40:08] PROBLEM - aqs endpoints health on aqs1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:42:45] AndyRussG: ? [16:43:25] bblack: hi! sorry for the bother... Was wondering if you know of any logs, even sampled, of the names of cookies we see in the wild? [16:43:40] RECOVERY - aqs endpoints health on aqs1001 is OK: All endpoints are healthy [16:44:48] I'd like to vaccum up old cookies that have been created by on-wiki CentralNotice banner JS... There's other ways to get this info, but if there's anything from prod, that'd be an additional source to check with [16:45:07] Thx! [16:50:14] AndyRussG: yeah I can at least find some of the low-hanging fruit (top pointless cookies still being sent) [16:50:35] bblack: fantastic, thx much! [16:51:16] 6Operations, 10DBA, 6Performance-Team: Job Queue growing and then running a lot of jobs at once on commonswiki - https://phabricator.wikimedia.org/T132318#2195600 (10EBernhardson) Poked around but not finding any red herrings. I think it's a symptom rather than a cause, but slave-lag-limit looks to be retur... [16:52:09] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:56:31] 6Operations, 10MediaWiki-Parser, 10Traffic, 10Wikidata, and 4 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2195613 (10bmansurov) [16:57:48] (03CR) 10Luke081515: [C: 04-1] "I see some big problems in this change, because all wikis are affected:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280002 (owner: 10Jforrester) [16:58:30] James_F: I got three problems which will ocur, if you patch is live ^ [17:00:04] gehel: Respected human, time to deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160411T1700). Please do the needful. [17:07:45] 6Operations, 6Labs, 10Labs-Infrastructure: labnet1002 can't talk to webproxy.eqiad.wmnet:8080, puppet fails to install designateclient - https://phabricator.wikimedia.org/T129623#2195699 (10Dzahn) on labnet1002, running puppet seems fine now: Notice: Finished catalog run in 18.62 seconds [labnet1002:~] $ [17:08:07] 6Operations, 10DBA, 6Performance-Team: Job Queue growing and then running a lot of jobs at once on commonswiki - https://phabricator.wikimedia.org/T132318#2195701 (10EBernhardson) fwiw, when we have a spike in the DB the number of refreshLinksDynamic jobs run on commonswiki per minute spikes from ~5k to 50k:... [17:09:17] 6Operations, 6Labs, 10Labs-Infrastructure: labnet1002 can't talk to webproxy.eqiad.wmnet:8080, puppet fails to install designateclient - https://phabricator.wikimedia.org/T129623#2195709 (10Dzahn) 5Open>3Resolved a:3chasemp @chasemp resolved, right? [labnet1002:~] $ nc webproxy.eqiad.wmnet 8080 GET in... [17:15:07] 6Operations, 10DBA, 6Performance-Team: Job Queue growing and then running a lot of jobs at once on commonswiki - https://phabricator.wikimedia.org/T132318#2195763 (10Volans) fwiw I don't see spikes on any other shard, only on s4 (commonswiki), from the databases point of view. [17:26:02] (03PS2) 10Volans: MariaDB: use Puppet certs for s5 cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/282684 (https://phabricator.wikimedia.org/T111654) [17:27:33] 6Operations: Linking a bn.wikipedia.org button to G+ page. - https://phabricator.wikimedia.org/T109810#2195842 (10Dzahn) a:3Jalexander [17:28:30] (03PS1) 10ArielGlenn: add snapshots 1005-1007 to dsh group for mediawiki installs [puppet] - 10https://gerrit.wikimedia.org/r/282730 [17:28:54] !log deploying latest WDQS version [17:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:29:11] 6Operations: Ferm rules for netmon1001 - https://phabricator.wikimedia.org/T105410#1443320 (10Dzahn) [17:30:34] (03CR) 10ArielGlenn: [C: 032] add snapshots 1005-1007 to dsh group for mediawiki installs [puppet] - 10https://gerrit.wikimedia.org/r/282730 (owner: 10ArielGlenn) [17:32:53] 6Operations, 10Traffic, 10Wiki-Loves-Monuments-General, 7HTTPS: configure https for www.wikilovesmonuments.org - https://phabricator.wikimedia.org/T118388#2195850 (10Dzahn) @Akoopal great, thank you [17:33:44] (03PS1) 10Volans: Depool db1049 to deploy Pupper certs for TLS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282732 (https://phabricator.wikimedia.org/T111654) [17:33:45] what does "moved to backlog" actually do? i get a notification that tells me "it hasn't been done yet", kind of a non-update [17:34:11] I guess it depends on the team and where it was moved from? [17:34:15] at best it tells me "not going to happen soon" [17:34:31] true, Nikerabbit [17:35:08] unless it happens to be an active sprint, I would not expect good news ;) [17:36:00] hehe, yea [17:36:30] !log rebooting wdqs1001 [17:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:37:47] mutante: a notification? notifications for workboard moves are totally useless [17:38:06] gehel: are you still doing deployments? [17:38:12] 6Operations: Ferm rules for netmon1001 - https://phabricator.wikimedia.org/T105410#2195855 (10Dzahn) Thanks @Muehlenhoff ! [17:38:30] PROBLEM - Host wdqs1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:38:31] volans: yep, running late, Still rebooting wdqs [17:38:44] YuviPanda: Thanks for merging my patch :) [17:38:54] Amir1: np! thanks for making it :) [17:38:54] can someone ack the alert about wdqs1001? [17:38:58] gehel: ^ [17:39:06] that's me ... [17:39:30] RECOVERY - Host wdqs1001 is UP: PING WARNING - Packet loss = 64%, RTA = 0.37 ms [17:39:31] gehel: ok, can you ping me when you finish please? need to depool a db and merge on puppet :) [17:39:44] Nemo_bis: yes, i use notifications for everything in phab, no email. and i agree, the ones that just tell me "moved to done" and stuff seem so redundant (they are already status resolved) [17:39:54] i'll adjust config [17:40:00] volans: I don't think there should be any conflict with WDQS... [17:41:03] gehel: ok, thanks, proceeding then [17:43:29] mutante: ah yes, the meaningless columns are especially annoying [17:43:43] Nemo_bis: ack [17:44:39] !log rebooting wdqs1002 for kernel upgrade [17:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:49:06] !log catrope@tin Synchronized php-1.27.0-wmf.20/includes/RevisionList.php: Attempt to fix oversight timeouts (duration: 00m 28s) [17:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:53:56] 6Operations, 13Patch-For-Review, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2195907 (10Dzahn) [17:53:58] 6Operations, 10DBA: upgrade db servers to jessie - https://phabricator.wikimedia.org/T125028#2195905 (10Dzahn) 5Invalid>3Open yes please, what was the outcome. should we link a decom ticket instead? [17:56:43] 6Operations, 6Labs, 10Labs-Infrastructure: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#2195914 (10Dzahn) [17:57:19] 6Operations, 10DBA, 6Labs, 10Labs-Infrastructure: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#2195917 (10Dzahn) [17:58:16] 6Operations, 10ops-eqiad, 6DC-Ops, 13Patch-For-Review: decom caesium - https://phabricator.wikimedia.org/T125165#2195921 (10Dzahn) @Cmjohnson what's missing to close this one? [17:58:31] 6Operations, 10ops-eqiad, 6DC-Ops: decom caesium - https://phabricator.wikimedia.org/T125165#2195922 (10Dzahn) [17:59:33] (03CR) 10Jforrester: "See the follow-up patch for Wikidata." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280002 (owner: 10Jforrester) [18:00:26] Warning: require_once(/etc/mediawiki/WikitechPrivateSettings.php): failed to open stream: No such file or directory in /srv/mediawiki/wmf-config/wikitech.php on line 171 [18:00:29] * AaronSchulz sighs [18:01:30] andrewbogott, ^ [18:02:14] AaronSchulz: if you open a bug for that I can look when I get back from lunch. But most of the time those are caused by the private and public repos being out of sync. [18:02:38] breaks mwscript [18:02:39] oh, wait, I misread [18:03:10] or a script not running on silver/labtestweb2001 trying to use labswiki config? [18:04:30] so mwscriptwikiset (or foreachwiki) is hard to use then [18:05:03] I wonder if we should have a file with dummy values for other hosts [18:05:13] or only include it if it exists [18:06:01] (03CR) 10Luke081515: "We have users at normal wikis too, who got methods to edit a lot of pages in a very short time, via scripts etc. This patch would block th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280002 (owner: 10Jforrester) [18:08:19] !log catrope@tin Synchronized php-1.27.0-wmf.20/includes/RevisionList.php: Undo oversight live hack (duration: 00m 25s) [18:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:08:48] Krenair: that include has been around for a long time, though… surely mwscript hasn't been broken all that time? [18:09:09] andrewbogott, no but foreachwiki might've been [18:09:26] ah, that's used infrequently? [18:09:44] and I think it only shows the error, doesn't completely break on such warnings [18:09:49] (03PS1) 10Aaron Schulz: Make "refreshLinksDynamic" low-priority [puppet] - 10https://gerrit.wikimedia.org/r/282736 [18:11:36] AaronSchulz: in the long run, adding a commit message to that patch that explains the context and rationale is going to save more time than it will take [18:13:34] (03PS2) 10Aaron Schulz: Make "refreshLinksDynamic" low-priority [puppet] - 10https://gerrit.wikimedia.org/r/282736 [18:13:58] (03PS3) 10Ori.livneh: Make "refreshLinksDynamic" low-priority [puppet] - 10https://gerrit.wikimedia.org/r/282736 (https://phabricator.wikimedia.org/T132318) (owner: 10Aaron Schulz) [18:14:26] (03CR) 10Ori.livneh: [C: 032 V: 032] Make "refreshLinksDynamic" low-priority [puppet] - 10https://gerrit.wikimedia.org/r/282736 (https://phabricator.wikimedia.org/T132318) (owner: 10Aaron Schulz) [18:16:01] 07Puppet, 10Beta-Cluster-Infrastructure: Setup puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792#2196009 (10Krenair) We chatted about this on IRC and my impression is that the issue is mostly (?) that production puppetmasters (which have exported resources s... [18:18:46] 07Puppet, 10Beta-Cluster-Infrastructure: Setup puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792#2196043 (10yuvipanda) That's the underlying cause, yes - if they used the same setup, they'd have the same config which would include support for exported resour... [18:19:20] 06Operations, 06Labs, 13Patch-For-Review: Kill the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#2196048 (10Krenair) [18:19:22] 07Puppet, 10Beta-Cluster-Infrastructure: Setup puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792#2196047 (10Krenair) [18:23:17] 06Operations: Linking a bn.wikipedia.org button to G+ page. - https://phabricator.wikimedia.org/T109810#1559747 (10Dzahn) @Jalexander sorry, i just assigned directly because i wasn't sure what project to add and made T132373 to request a new project for it. makes sense? [18:24:28] 06Operations, 10Analytics, 10Traffic: Generate a list of junk CN cookies being sent by clients - https://phabricator.wikimedia.org/T132374#2196113 (10BBlack) [18:25:38] 06Operations, 10Mail: Remove yana@ from legal-tm-vio@wikimedia.org - https://phabricator.wikimedia.org/T132230#2196130 (10JGulingan) Thank you for correcting me. I got the same request from Rachel Stallman to have Yana removed from this other legal alias: legal-reports@wikimedia.org They are unsure what othe... [18:27:38] 06Operations: Migrate titanium to jessie - https://phabricator.wikimedia.org/T123725#1936502 (10Dzahn) This hosts archiva.wikimedia.org https://wikitech.wikimedia.org/wiki/Archiva This indicates the Analytics team should be involved. [18:28:06] 06Operations, 10Analytics-Cluster: Migrate titanium to jessie (archiva.wikimedia.org upgrade) - https://phabricator.wikimedia.org/T123725#2196172 (10Dzahn) [18:28:23] (03PS2) 10BBlack: misc large objects refactor [puppet] - 10https://gerrit.wikimedia.org/r/282716 (https://phabricator.wikimedia.org/T128813) [18:29:04] 06Operations, 10Analytics-Cluster: Migrate titanium to jessie (archiva.wikimedia.org upgrade) - https://phabricator.wikimedia.org/T123725#1936502 (10Dzahn) https://wikitech.wikimedia.org/wiki/Analytics/Archiva @Analytics Does an upgrade of this server to jessie have blockers that are already known? [18:29:36] 06Operations, 10Mail: Remove yana@ from legal-tm-vio@wikimedia.org - https://phabricator.wikimedia.org/T132230#2196179 (10Krenair) I think @Andrew is the person dealing with these tickets this week (?) Unless the underlying account is being closed, it might be a good idea to find out exactly which aliases po... [18:37:17] 06Operations, 06Performance-Team, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Figure out how to migrate the jobqueues - https://phabricator.wikimedia.org/T124673#2196249 (10aaron) [18:37:20] 06Operations, 06Performance-Team, 05codfw-rollout, 03codfw-rollout-Jan-Mar-2016: Ensure post-send handlers check and respect read-only-mode - https://phabricator.wikimedia.org/T129250#2196244 (10aaron) 05Open>03Resolved a:03aaron [18:37:30] 06Operations, 10media-storage: Unable to delete, restore/undelete, move or upload new versions of files on several wikis ("inconsistent state within the internal storage backends") - https://phabricator.wikimedia.org/T128096#2196251 (10NahidSultan) I don't know if it's relevant here but this [[ https://upload.... [18:37:55] 06Operations, 10Analytics: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#2196265 (10Dzahn) p:05Low>03Normal could we raise the prio slightly to normal? are there other services here besides Apache? [18:45:21] (03PS1) 10EBernhardson: Specify specific version of elasticsearch package [puppet] - 10https://gerrit.wikimedia.org/r/282743 (https://phabricator.wikimedia.org/T132376) [18:47:33] 06Operations, 07Icinga: upgrade neon (icinga) to jessie - https://phabricator.wikimedia.org/T125023#2196302 (10Dzahn) @Akosiaris would it be helpful if i'd request a separate misc. machine and put the neon roles on a jessie machine, (while not touching neon), then see if we can replace it? [18:47:59] mark, hi, any updates on hardware? [18:48:07] yurik: quotes just came in, but not complete yet [18:48:41] ebernhardson, thx! I presume that's the backends, right? Any updates on varnish? [18:49:13] 06Operations, 07Blocked-on-RelEng: Phase out antimony.wikimedia.org - https://phabricator.wikimedia.org/T123718#2196309 (10Dzahn) p:05Triage>03Normal [18:49:28] 06Operations, 07Blocked-on-RelEng: Phase out antimony.wikimedia.org - https://phabricator.wikimedia.org/T123718#1936415 (10Dzahn) 05Open>03stalled [18:49:30] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2196311 (10Dzahn) [18:49:38] yurik: still waiting on mark about varnish. The budget to get 4 replacements next FY looks to be progressing, but mark has to figure out about the other 12 [18:50:01] 06Operations, 07Blocked-on-RelEng: Phase out antimony.wikimedia.org - https://phabricator.wikimedia.org/T123718#1936415 (10Dzahn) This is stalled as long as we use gitblit. (Or we just upgrade it in place anyways, so we don't wait for that) [18:50:07] ebernhardson, gotcha, thx for the update! [18:50:54] 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate: Phase out antimony.wikimedia.org - https://phabricator.wikimedia.org/T123718#2196329 (10Dzahn) [18:51:30] (03CR) 10Boshomi: [C: 04-1] "I am a user with a lot of small edits in very short time: My normal workflow:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280002 (owner: 10Jforrester) [18:52:11] (03CR) 10Tjones: "I added a comment which I think has the right syntax to list the relevant languages for enwiki." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268048 (https://phabricator.wikimedia.org/T121542) (owner: 10EBernhardson) [18:53:39] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [5000000.0] [18:54:08] 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate: Phase out antimony.wikimedia.org - https://phabricator.wikimedia.org/T123718#1936415 (10Dzahn) ... or how about just putting gitblit on a jessie VM and shutting this server down? shouldn't have a hard blocker.hm? [18:54:31] 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate: Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit) - https://phabricator.wikimedia.org/T123718#2196342 (10Dzahn) [18:56:09] (03CR) 10Boshomi: "s/:en:Benutzer:TMg/weblinkChecker/:de:Benutzer:TMg/weblinkChecker" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280002 (owner: 10Jforrester) [19:00:17] (03PS3) 10BBlack: misc large objects refactor [puppet] - 10https://gerrit.wikimedia.org/r/282716 (https://phabricator.wikimedia.org/T128813) [19:01:42] !log ori@tin Synchronized php-1.27.0-wmf.20/extensions/WikimediaEvents: I672624e9fc30: Collect impact of proposed ResourceLoader feature-test in statsd (duration: 00m 34s) [19:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:01:48] ^ Krinkle [19:02:31] ori: thx [19:03:53] 06Operations, 10MediaWiki-Parser, 06Parsing-Team, 10Traffic, and 4 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2196375 (10Jdlrobson) [19:03:59] (03CR) 10MGChecker: [C: 04-1] "This would be blocking much of productive work." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280002 (owner: 10Jforrester) [19:04:02] 06Operations, 10MediaWiki-Parser, 06Parsing-Team, 10Traffic, and 4 others: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#1870420 (10Jdlrobson) I've spent hours on this and although I cannot replicate I would like to try https://gerrit.wikimedia.org/... [19:06:32] 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate: Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit) - https://phabricator.wikimedia.org/T123718#2196410 (10demon) >>! In T123718#2196331, @Dzahn wrote: > ... or how about just putting gitblit on a jessie VM and shutting this server down? shou... [19:07:17] (03CR) 10Smalyshev: A/B/C test of control vs textcat vs accept-lang + textcat (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/268048 (https://phabricator.wikimedia.org/T121542) (owner: 10EBernhardson) [19:13:22] (03PS1) 10Dzahn: fix es2018 mgmt IP address [dns] - 10https://gerrit.wikimedia.org/r/282748 [19:14:25] (03PS2) 10Dzahn: fix es2018 mgmt IP address [dns] - 10https://gerrit.wikimedia.org/r/282748 [19:15:06] (03CR) 10Dzahn: [C: 032] "[iron:~] $ host es2018.mgmt.codfw.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/282748 (owner: 10Dzahn) [19:16:44] (03CR) 10Dzahn: "i found one IP already used in the new network 10.193.3.0 (fixed in https://gerrit.wikimedia.org/r/#/c/282748/), needs rebase on that. do" [dns] - 10https://gerrit.wikimedia.org/r/281449 (https://phabricator.wikimedia.org/T130941) (owner: 10Papaul) [19:17:08] PROBLEM - puppet last run on mw1115 is CRITICAL: CRITICAL: Puppet has 2 failures [19:20:10] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3852 MB (10% inode=93%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 543373 MB (37% inode=99%) [19:20:46] manpages.ubuntu.com - unable to connect .. come on [19:20:54] google! [19:22:17] first #ubuntu :p [19:25:08] PROBLEM - check_disk on lutetium is CRITICAL: DISK CRITICAL - free space: / 3463 MB (9% inode=93%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 542021 MB (37% inode=99%) [19:25:59] 12:30 < MonkeyDust> mutante i guess the site is being updated, due to the upcoming xenial release [19:26:22] mutante: Hmm, where could I get traffic data on something that's proxied from misc-web lvs? [19:27:50] ostriches: hmm.. i think Hive https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Queries#Pageviews [19:28:32] I suppose A) I need hive :) [19:28:40] and B) is non-project data captured there too? :) [19:29:21] a) it should be stat1002 access b) yea, should. i remember checking it for "m.login" [19:29:28] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive#Cluster_access [19:30:08] RECOVERY - check_disk on lutetium is OK: DISK OK - free space: / 25492 MB (71% inode=93%): /dev 32200 MB (99% inode=99%): /run 6403 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 32209 MB (100% inode=99%): /srv 540676 MB (37% inode=99%) [19:34:12] (03PS1) 10Chad: Gitblit: remove apache crud [puppet] - 10https://gerrit.wikimedia.org/r/282750 [19:34:41] mutante: Btw, I'm fine with moving gitblit to a VM and off of bare metal. ^ should make it even easier [19:34:45] Less dependencies. [19:34:54] cool, sounds good :) [19:35:14] i'll just add one with jessie [19:35:48] ostriches: apache logs on the backend dont tell us? [19:36:07] Nope, lvs connects directly to gitblit on 8080 [19:36:14] ah, right [19:36:16] Apache's just been running there on 80 with nothing pointing to it in ages. [19:36:19] hehe [19:36:23] :p ok [19:36:56] looks at the gerrit change [19:37:34] nice, and one SSL setup less, also good [19:38:12] * fewer. [19:38:13] ;-) [19:40:28] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [19:41:10] !log antimony (gitblit) - stop Apache [19:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:41:44] (03PS2) 10Dzahn: Gitblit: remove apache crud [puppet] - 10https://gerrit.wikimedia.org/r/282750 (owner: 10Chad) [19:41:52] (03CR) 10Dzahn: [C: 032] Gitblit: remove apache crud [puppet] - 10https://gerrit.wikimedia.org/r/282750 (owner: 10Chad) [19:42:33] Well apache is dead and git.wm.o still loads so my theory was right :P [19:42:36] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: wdqs1002 does not reboot, stops at "Scanning for devices" - https://phabricator.wikimedia.org/T132387#2196527 (10Gehel) [19:44:21] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: wdqs1002 does not reboot, stops at "Scanning for devices" - https://phabricator.wikimedia.org/T132387#2196542 (10Gehel) [19:44:48] PROBLEM - HHVM rendering on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:58] PROBLEM - git.wikimedia.org on antimony is CRITICAL: Connection refused [19:45:00] PROBLEM - Apache HTTP on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:45:11] haha, i knew it icinga [19:46:08] PROBLEM - RAID on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:46:10] PROBLEM - salt-minion processes on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:46:10] PROBLEM - HHVM processes on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:46:18] PROBLEM - SSH on mw1115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:29] Ph pshaw. [19:46:38] PROBLEM - configured eth on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:47:19] PROBLEM - Check size of conntrack table on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:47:19] PROBLEM - Disk space on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:47:20] PROBLEM - nutcracker process on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:47:30] PROBLEM - DPKG on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:47:45] looks at mw1115 [19:47:50] PROBLEM - dhclient process on mw1115 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:48:06] still running [19:48:14] (03CR) 10Volans: [C: 032] Depool db1049 to deploy Pupper certs for TLS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282732 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [19:48:41] (03Merged) 10jenkins-bot: Depool db1049 to deploy Pupper certs for TLS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282732 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [19:49:42] ostriches: that monitoring check was probably always just on the Apache itself [19:50:11] Yeah apache on the host, which isn't what we care about anymore [19:50:17] yep [19:50:35] lvs should already complain if it can't hit the port, right? [19:51:49] PROBLEM - Disk space on elastic1013 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80325 MB (15% inode=99%) [19:51:57] hmmm. there is no check for that though [19:52:03] not in icinga [19:52:18] PROBLEM - nutcracker port on mw1115 is CRITICAL: Timeout while attempting connection [19:52:23] there is for things that are foo.svc.eqiad.wmnet [19:52:33] !log powercycled mw1115 [19:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:52:39] i can look at fixing that check [19:52:47] gotta move it to a virtual host [19:53:07] !log volans@tin Synchronized wmf-config/db-eqiad.php: Depool db1049 to deploy Pupper certs for TLS - T111654 (duration: 02m 25s) [19:53:08] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [19:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:53:14] because if it's a service on antimony, it will use antimony's IP [19:53:18] RECOVERY - dhclient process on mw1115 is OK: PROCS OK: 0 processes with command name dhclient [19:53:30] RECOVERY - RAID on mw1115 is OK: OK: no RAID installed [19:53:38] RECOVERY - HHVM processes on mw1115 is OK: PROCS OK: 6 processes with command name hhvm [19:53:38] RECOVERY - salt-minion processes on mw1115 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:53:39] RECOVERY - SSH on mw1115 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [19:53:59] RECOVERY - configured eth on mw1115 is OK: OK - interfaces up [19:53:59] RECOVERY - nutcracker port on mw1115 is OK: TCP OK - 0.000 second response time on port 11212 [19:54:19] RECOVERY - Apache HTTP on mw1115 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 1.304 second response time [19:54:21] mutante: with my luck mw1115 crashed just while running sync-file... any quick way to run it again only on this host? [19:54:39] RECOVERY - Disk space on mw1115 is OK: DISK OK [19:54:39] RECOVERY - Check size of conntrack table on mw1115 is OK: OK: nf_conntrack is 7 % full [19:54:48] RECOVERY - nutcracker process on mw1115 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [19:54:49] RECOVERY - DPKG on mw1115 is OK: All packages OK [19:54:58] PROBLEM - Host wdqs1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:55:25] !log antimony - remove Apache package and config [19:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:55:58] RECOVERY - HHVM rendering on mw1115 is OK: HTTP OK: HTTP/1.1 200 OK - 73877 bytes in 0.152 second response time [19:56:13] volans: not sure, i haven't deployed mw in a reaaallly long time [19:56:22] it used to be that you could edit the dsh file [19:56:34] ok running it again, it's just one file should not hurt [19:56:46] ok, good [19:57:19] !log volans@tin Synchronized wmf-config/db-eqiad.php: Depool db1049 to deploy Pupper certs for TLS - T111654 (duration: 00m 28s) [19:57:19] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [19:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:57:28] done, successfully [19:59:01] :) [19:59:38] RECOVERY - Disk space on elastic1013 is OK: DISK OK [20:00:03] !log antimony - shred and delete git.wikimedia.org SSL key [20:00:04] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160411T2000). [20:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:01:56] 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate: Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit) - https://phabricator.wikimedia.org/T123718#2196572 (10Dzahn) We removed Apache and config and the SSL key from antimony... Gotta fix the Icinga monitoring check next, since it checked Apa... [20:03:11] !log Deploy and use Puppet certs for TLS on cross-dc replica for shard s5 T111654 [20:03:12] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [20:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:05:16] !log temporarily changing elasticsearch high watermark to 75% to rebalance cluster [20:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:06:26] Krenair: "it's unclear who should do it, can we create a project?" "only if we list who has access" "well, somebody needs to compile a list" "who can do that?" "anyone with access" ?:) [20:07:07] mutante, you have access don't you? [20:07:37] I'm looking for someone will agree to be assigned the first task, then we can open the project for it [20:08:07] besides the whole point of opening that ticket was that it's not an ops thing [20:09:17] !log synced code; restarted Parsoid on wtp1001.eqiad as a canary [20:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:09:25] i asked for a project because i didnt know the answer to that question [20:09:37] (03PS3) 10Volans: MariaDB: use Puppet certs for s5 cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/282684 (https://phabricator.wikimedia.org/T111654) [20:09:54] and i didnt want to remove operations because that leaves tickets without project [20:11:28] (03CR) 10Papaul: "mgmt network is /16. so we have 10.193.0.0/16. I think the goal of choosing /16 was to have more hosts and less number of subnets. with /2" [dns] - 10https://gerrit.wikimedia.org/r/281449 (https://phabricator.wikimedia.org/T130941) (owner: 10Papaul) [20:11:43] (03CR) 10Volans: [C: 032] MariaDB: use Puppet certs for s5 cross-dc replica [puppet] - 10https://gerrit.wikimedia.org/r/282684 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [20:13:55] mutante: how was the quick way to force the merge on strontium when it fails? [20:14:49] volans: cd /var/lib/git/operations/puppet and git pull origin [20:14:55] on strontium directlu [20:14:59] yea [20:15:01] s/u/y/ [20:15:13] or you can repeat puppet-merge [20:15:26] tried but on palladium nothing to merge :D [20:15:41] yea, then try the former [20:16:03] thanks, it pulled the change, shuld work [20:16:43] (03CR) 10Dzahn: "aha! in that case i'm not sure if "$ORIGIN 3.193.{{ zonename }}." should be added like that" [dns] - 10https://gerrit.wikimedia.org/r/281449 (https://phabricator.wikimedia.org/T130941) (owner: 10Papaul) [20:16:47] cool [20:19:03] !log updated Parsoid to version e3766b79 [20:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:20:00] (03CR) 10Gehel: "The changes to hiera are done for the elasticsearch role, but not for logstash. Which means that we pin the elasticsearch version for the " [puppet] - 10https://gerrit.wikimedia.org/r/282743 (https://phabricator.wikimedia.org/T132376) (owner: 10EBernhardson) [20:21:19] (03PS1) 10Chad: gitblit: don't monitor if apache is running anymore [puppet] - 10https://gerrit.wikimedia.org/r/282758 [20:22:04] mutante: ^ :) [20:23:43] (03PS1) 10Rush: pam_limits tools bastion parameters [puppet] - 10https://gerrit.wikimedia.org/r/282759 (https://phabricator.wikimedia.org/T131541) [20:24:58] ostriches: let me fix the check instead of removing it? [20:25:27] well, that is correct either way [20:25:40] it would have to be added in a different place [20:26:32] Krenair: i checked who has access and added a comment (it's separate for each domain though) [20:27:29] 06Operations, 10Analytics, 10MediaWiki-extensions-CentralNotice, 10Traffic: Generate a list of junk CN cookies being sent by clients - https://phabricator.wikimedia.org/T132374#2196672 (10AndyRussG) [20:27:40] 06Operations, 10Analytics, 10MediaWiki-extensions-CentralNotice, 10Traffic: Generate a list of junk CN cookies being sent by clients - https://phabricator.wikimedia.org/T132374#2196113 (10AndyRussG) Thanks!! :) [20:27:54] (03PS2) 10RobH: DNS: Adding mgmt DNS for spare pool servers Bug: T130941 [dns] - 10https://gerrit.wikimedia.org/r/281449 (https://phabricator.wikimedia.org/T130941) (owner: 10Papaul) [20:28:14] mutante: That works too :) [20:28:40] (03PS2) 10Dzahn: gitblit: don't monitor if apache is running anymore [puppet] - 10https://gerrit.wikimedia.org/r/282758 (https://phabricator.wikimedia.org/T123718) (owner: 10Chad) [20:28:43] (03PS1) 10Chad: Gerrit: replicate git repositories to new home [puppet] - 10https://gerrit.wikimedia.org/r/282761 [20:29:00] (03CR) 10RobH: [C: 032] DNS: Adding mgmt DNS for spare pool servers Bug: T130941 [dns] - 10https://gerrit.wikimedia.org/r/281449 (https://phabricator.wikimedia.org/T130941) (owner: 10Papaul) [20:30:20] (03CR) 10jenkins-bot: [V: 04-1] Gerrit: replicate git repositories to new home [puppet] - 10https://gerrit.wikimedia.org/r/282761 (owner: 10Chad) [20:31:20] (03CR) 10Dzahn: [C: 032] "gonna add this on neon, if we put this in the role it will always be applied on antimony and use that IP, instead we need to add a virtual" [puppet] - 10https://gerrit.wikimedia.org/r/282758 (https://phabricator.wikimedia.org/T123718) (owner: 10Chad) [20:32:40] (03PS1) 10Volans: Repool db1049 for vslow and dump [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282764 (https://phabricator.wikimedia.org/T111654) [20:33:08] (03PS1) 10Bartosz Dziewoński: Remove $wgApiFrameOptions override for UploadWizard wikis, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282768 (https://phabricator.wikimedia.org/T131182) [20:33:19] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Empty result on a tree query - https://phabricator.wikimedia.org/T127014#2196697 (10BBlack) 05Open>03Resolved a:03BBlack We ended up solving this in reverse order. The betwee_bytes_timeout values are already raised for varnish<->varnish... [20:34:21] ACKNOWLEDGEMENT - Host wdqs1002 is DOWN: PING CRITICAL - Packet loss = 100% Gehel https://phabricator.wikimedia.org/T132387 [20:34:50] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Empty result on a tree query - https://phabricator.wikimedia.org/T127014#2196704 (10Smalyshev) Confirmed, the query runs fine now, thanks @BBlack! [20:39:35] (03PS1) 10Dzahn: gitblit/icinga: move git.wm.org monitoring to virtual host [puppet] - 10https://gerrit.wikimedia.org/r/282796 (https://phabricator.wikimedia.org/T123718) [20:39:53] (03PS2) 10Rush: pam_limits tools bastion parameters [puppet] - 10https://gerrit.wikimedia.org/r/282759 (https://phabricator.wikimedia.org/T131541) [20:40:16] (03PS2) 10Dzahn: gitblit/icinga: move git.wm.org monitoring to virtual host [puppet] - 10https://gerrit.wikimedia.org/r/282796 (https://phabricator.wikimedia.org/T123718) [20:40:47] (03PS3) 10Dzahn: gitblit/icinga: move git.wm.org monitoring to virtual host [puppet] - 10https://gerrit.wikimedia.org/r/282796 (https://phabricator.wikimedia.org/T123718) [20:40:51] (03CR) 10Rush: [C: 032 V: 032] pam_limits tools bastion parameters [puppet] - 10https://gerrit.wikimedia.org/r/282759 (https://phabricator.wikimedia.org/T131541) (owner: 10Rush) [20:41:22] volans, if you have a chance, could you look at https://gerrit.wikimedia.org/r/282440 ? It's part of a project to migrate Flow content into its own External Store, to unblock the External Store recompression. [20:41:43] !log mwscript deleteEqualMessages.php --wiki plwiki (T45917) [20:41:44] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [20:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:42:02] (03PS4) 10Dzahn: gitblit/icinga: move git.wm.org monitoring to virtual host [puppet] - 10https://gerrit.wikimedia.org/r/282796 (https://phabricator.wikimedia.org/T123718) [20:43:01] (03CR) 10Dzahn: [C: 032] gitblit/icinga: move git.wm.org monitoring to virtual host [puppet] - 10https://gerrit.wikimedia.org/r/282796 (https://phabricator.wikimedia.org/T123718) (owner: 10Dzahn) [20:43:02] !log starting mobileapps deploy [20:43:03] (03CR) 10jenkins-bot: [V: 04-1] gitblit/icinga: move git.wm.org monitoring to virtual host [puppet] - 10https://gerrit.wikimedia.org/r/282796 (https://phabricator.wikimedia.org/T123718) (owner: 10Dzahn) [20:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:43:14] (03CR) 10CSteipp: [C: 032] Enable Ex:OATHAuth in beta, disabled for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282198 (https://phabricator.wikimedia.org/T131420) (owner: 10CSteipp) [20:43:32] matt_flaschen: I can, but I have no context on it [20:44:30] (03PS6) 10Dzahn: gitblit/icinga: move git.wm.org monitoring to virtual host [puppet] - 10https://gerrit.wikimedia.org/r/282796 (https://phabricator.wikimedia.org/T123718) [20:44:45] volans, are you familiar with External Store in general? [20:44:53] (03CR) 10jenkins-bot: [V: 04-1] gitblit/icinga: move git.wm.org monitoring to virtual host [puppet] - 10https://gerrit.wikimedia.org/r/282796 (https://phabricator.wikimedia.org/T123718) (owner: 10Dzahn) [20:45:06] (03PS7) 10Dzahn: gitblit/icinga: move git.wm.org monitoring to virtual host [puppet] - 10https://gerrit.wikimedia.org/r/282796 (https://phabricator.wikimedia.org/T123718) [20:45:32] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/282796 (https://phabricator.wikimedia.org/T123718) (owner: 10Dzahn) [20:46:22] matt_flaschen: only from the DB point of view (es1, es2, es3), not much from the application point of view [20:47:10] (03Merged) 10jenkins-bot: Enable Ex:OATHAuth in beta, disabled for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282198 (https://phabricator.wikimedia.org/T131420) (owner: 10CSteipp) [20:47:33] (03CR) 10Dzahn: [C: 032] gitblit/icinga: move git.wm.org monitoring to virtual host [puppet] - 10https://gerrit.wikimedia.org/r/282796 (https://phabricator.wikimedia.org/T123718) (owner: 10Dzahn) [20:47:39] volans, okay, well, as you probably know the main point is to get the text outside the main text tables, and have better compression. It also has functionality to recompress these external tables, and delete unreferenced content. [20:48:07] volans, Flow does not record its uses in the text table, so it would be treated as unreferenced. There was a discussion about the best way to solve this, and we decided to give Flow its own External Store cluster. [20:48:24] (03CR) 10Dzahn: "could we please merge without having to recheck 3 times?" [puppet] - 10https://gerrit.wikimedia.org/r/282796 (https://phabricator.wikimedia.org/T123718) (owner: 10Dzahn) [20:48:43] mutante: what box is the mail relay now? Presumably no longer polonium... [20:48:45] (03CR) 10Dzahn: "what do you mean "needs verified", you JUST DID" [puppet] - 10https://gerrit.wikimedia.org/r/282796 (https://phabricator.wikimedia.org/T123718) (owner: 10Dzahn) [20:48:56] (this is regarding https://wikitech.wikimedia.org/wiki/Mail#Modify_aliases ) [20:49:58] andrewbogott: eh, yea, that's really old. we dont manually hack configs anymore since years [20:50:06] andrewbogott: this instead https://wikitech.wikimedia.org/wiki/Ops_Clinic_Duty#Mail_aliases [20:50:20] thanks [20:50:30] on palladium, go to /root/private/modules/privateexim/ [20:50:37] make changes, git commit [20:50:41] We need a little tickbox on each wikitech page "SEARCH, never show me this page again" [20:50:48] haha, yes [20:51:25] volans, the master task for this is https://phabricator.wikimedia.org/T106363 . [20:51:57] andrewbogott: thanks for taking it [20:52:12] volans, as part of this, we're in the process of setting up External Store on Beta (before it wasn't used at all for External Store or Flow). Then after it's set up for regular and Flow, we'll set up a new Beta cluster and do a dry run migration. Then the real migration on Beta, and finally prod. [20:52:24] See https://phabricator.wikimedia.org/T119568 , etc. [20:52:35] (03PS1) 10Bartosz Dziewoński: Enable $wgAbuseFilterProfile for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282806 (https://phabricator.wikimedia.org/T132200) [20:53:01] (03CR) 10Bartosz Dziewoński: [C: 04-1] "To be merged on Wednesday after the train deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282806 (https://phabricator.wikimedia.org/T132200) (owner: 10Bartosz Dziewoński) [20:53:33] !log csteipp@tin Synchronized wmf-config/InitialiseSettings-labs.php: retry oath in labs (duration: 00m 32s) [20:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:54:56] 06Operations, 10Mail: Remove yana@ from legal-tm-vio@wikimedia.org - https://phabricator.wikimedia.org/T132230#2196829 (10Andrew) 05Open>03Resolved a:03Andrew I'm removing Yana from legal-tm-vio right now. I don't see her username anywhere else in the alias file, so this may be the end of it. [20:55:05] !log csteipp@tin Synchronized wmf-config/CommonSettings-labs.php: retry oath in labs (duration: 00m 29s) [20:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:58:14] robh: are you still planning on sorting out https://phabricator.wikimedia.org/T131252 or should I take it on? [20:59:29] matt_flaschen: I understand the reasoning and all of this, but I have no idea how the servers in the code review where set, so I really cannot say much about this code review [20:59:50] andrewbogott: feel free if you get to it before me [20:59:57] just make sure you document it cuz its broken now! [21:00:09] volans, okay, no problem. I'll follow up with jynus when he gets back. [21:00:12] aside, cluster1 on current production is es1 that is read-only, although for a new setup it could make sense to start from 1 again [21:00:13] robh: I can't tell if that bug now signifies "Explain things to Reedy" or "Make actual change in access" [21:00:21] both! [21:00:40] the explain to Reedy is easy. Reedy: when we give you librenms access, dont go publishing it all without oversight ;D [21:00:41] Do you even 3-phase, brah? [21:00:51] and how 3 phase works i can email [21:00:52] =] [21:01:16] Reedy: or we already discussed? i dont recall at this point. [21:01:24] Nope, we've not [21:01:25] I've explained our power infrastructure a lot in the past month. [21:01:36] heh, ok, cool. So lemme find the links. [21:01:45] But I did know phase3 (lol) isn't exactly the same [21:01:49] actually, mid task in other channel, lemme finish that and brb [21:01:57] !log mobileapps deployed 6ef3054 [21:01:57] ok, I will grant the access but don't want to be responsible for misinforming Reedy :) [21:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:02:14] andrewbogott: Can you actually grant access? I think many have tried and failed :P [21:02:20] So, Reedy, the power grid is a series of tubes. [21:02:37] andrewbogott: indeed the access grant is broken [21:02:42] for both adding sam and the new opsen [21:02:44] thats the real blocker. [21:02:49] :( ok [21:02:54] (03PS1) 10CSteipp: Revert "Enable Ex:OATHAuth in beta, disabled for all users" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282808 [21:03:38] (03CR) 10CSteipp: [C: 032] Revert "Enable Ex:OATHAuth in beta, disabled for all users" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282808 (owner: 10CSteipp) [21:03:45] So someone needs to dig into it and find out wtf is going on. [21:04:04] (03Merged) 10jenkins-bot: Revert "Enable Ex:OATHAuth in beta, disabled for all users" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282808 (owner: 10CSteipp) [21:04:40] php sucks? [21:04:48] * andrewbogott puts on chute, dives into rabbithole [21:06:23] !log csteipp@tin Synchronized wmf-config/InitialiseSettings-labs.php: reverting oath (duration: 00m 29s) [21:06:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:07:01] #rabbithole should be a phab tag :) [21:07:20] !log csteipp@tin Synchronized wmf-config/CommonSettings-labs.php: revert oath in labs (duration: 00m 27s) [21:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:12:19] ostriches: virtual server git.wikimedia.org with service of the same name on it: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=git.wikimedia.org&nostatusheader [21:12:42] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: wdqs1002 does not reboot, stops at "Scanning for devices" - https://phabricator.wikimedia.org/T132387#2196527 (10RobH) I've disabled the lifecycle controller completely, so its not the issue. After more testing, I think the in... [21:13:26] 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate, 13Patch-For-Review: Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit) - https://phabricator.wikimedia.org/T123718#2196931 (10Dzahn) >>! In T123718#2196572, @Dzahn wrote: > We removed Apache and config and the SSL key from antimony... >... [21:17:27] ostriches: and there is also no reason left we still need to have a public IP for the gitblit host .. i assume [21:21:45] robh: can /you/ log in to https://librenms.wikimedia.org/ ? [21:21:57] yes [21:22:27] !log restore elasticsearch cluster high disk watermark to 90% [21:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:23:21] (03CR) 10Jforrester: [C: 031] "The relevant code is now live everywhere, so this should be good to go immediately." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282768 (https://phabricator.wikimedia.org/T131182) (owner: 10Bartosz Dziewoński) [21:29:42] !log mwscript deleteEqualMessages.php --wiki frwikibooks (T45917) [21:29:43] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [21:29:44] !log mwscript deleteEqualMessages.php --wiki bgwikiquote (T45917) [21:29:45] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [21:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:31:40] (03PS1) 10Dzahn: site/install/dhcp: add furud.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/282814 (https://phabricator.wikimedia.org/T123718) [21:32:57] (03PS2) 10Dzahn: site/install/dhcp: add furud.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/282814 (https://phabricator.wikimedia.org/T123718) [21:34:48] robh: will anyone care/notice if I break librenms logins for a while? [21:35:22] the users i know of are our network admins and onsite folks [21:35:34] so as long as papaul or cmjohnson1 arent actively using it this second i think its ok. [21:35:47] go for it! [21:40:40] (03PS1) 10Dzahn: introduce furud.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/282816 (https://phabricator.wikimedia.org/T123718) [21:41:20] (03PS2) 10Dzahn: introduce furud.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/282816 (https://phabricator.wikimedia.org/T123718) [21:42:36] (03CR) 10Dzahn: [C: 032] introduce furud.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/282816 (https://phabricator.wikimedia.org/T123718) (owner: 10Dzahn) [21:48:37] oh, eh.. gitblit really uses like 8GB RAM ? [21:50:00] top says VIRT 11.7g RES 4.7G for the java process but the commandline has -Xmx8g so that would be max [21:51:42] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: wdqs1002 does not reboot, stops at "Scanning for devices" - https://phabricator.wikimedia.org/T132387#2197043 (10Smalyshev) If it's getting reinstalled, can it be also linked to T120714 - i.e. extending diskspace and making cor... [21:51:58] 06Operations, 10ops-eqiad, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: wdqs1002 does not reboot, stops at "Scanning for devices" - https://phabricator.wikimedia.org/T132387#2197045 (10RobH) Please note: T120714 This should have an entirely new partman recipe designated to accomodate the new disk... [21:52:01] and how do you know how much space is left to assign on the ganeti host? [22:03:48] !log creating virtual machine furud.codfw.wmnet [22:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:07:02] (03PS2) 10Volans: Repool db1049 for vslow and dump [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282764 (https://phabricator.wikimedia.org/T111654) [22:07:23] !log ori@tin Synchronized php-1.27.0-wmf.20/includes/page/WikiPage.php: Ie9799f5ea: Increase triggerOpportunisticLinksUpdate() backoff TTL (duration: 00m 34s) [22:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:09:26] (03CR) 10Volans: [C: 032] Repool db1049 for vslow and dump [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282764 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [22:09:53] (03Merged) 10jenkins-bot: Repool db1049 for vslow and dump [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282764 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [22:11:40] !log volans@tin Synchronized wmf-config/db-eqiad.php: Repool db1049 after deploy of Puppet certs for TLS - T111654 (duration: 00m 30s) [22:11:41] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [22:13:42] else { [22:13:43] // FIXME return a warning that LDAP couldn't connect? [22:13:43] } [22:13:47] Grrrrrrrrrrrrrrrrr [22:14:23] god forbid you should report errors, or even have a logging framework [22:17:30] what code is this from andrewbogott? :) [22:19:02] librenms [22:27:05] !log Ran <get( 'refreshLinksDynamic' )->delete();>> on commonswiki [22:32:59] (03PS3) 10Dzahn: site/install/dhcp: add furud.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/282814 (https://phabricator.wikimedia.org/T123718) [22:37:18] (03CR) 10Dzahn: [C: 032] site/install/dhcp: add furud.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/282814 (https://phabricator.wikimedia.org/T123718) (owner: 10Dzahn) [22:40:41] 06Operations, 10DBA, 06Performance-Team, 13Patch-For-Review: Job Queue growing and then running a lot of jobs at once on commonswiki - https://phabricator.wikimedia.org/T132318#2197371 (10ori) 05Open>03Resolved a:03aaron * b543b9bf11afd4 should reduce the rate at which refreshLinksDynamic are enqueue... [22:47:06] (03PS2) 10BBlack: Do not redirect Samsung Smart TV 2015 and newer to mobile [puppet] - 10https://gerrit.wikimedia.org/r/281031 (https://phabricator.wikimedia.org/T127021) (owner: 10Bmansurov) [22:47:13] (03CR) 10BBlack: [C: 032 V: 032] Do not redirect Samsung Smart TV 2015 and newer to mobile [puppet] - 10https://gerrit.wikimedia.org/r/281031 (https://phabricator.wikimedia.org/T127021) (owner: 10Bmansurov) [22:49:45] (03CR) 10Bmansurov: "To reply to Phuedx' earlier comment, I was not able to test either. I didn't know how to test." [puppet] - 10https://gerrit.wikimedia.org/r/281031 (https://phabricator.wikimedia.org/T127021) (owner: 10Bmansurov) [22:52:56] (03CR) 10MaxSem: "Well, you update detection rules in MF and test them:) Not a real Varnish test but allows you to test the rules themselves, at least." [puppet] - 10https://gerrit.wikimedia.org/r/281031 (https://phabricator.wikimedia.org/T127021) (owner: 10Bmansurov) [23:00:04] RoanKattouw ostriches Krenair MaxSem Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160411T2300). [23:00:04] MatmaRex Dereckson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:09] hey [23:00:11] Hello. [23:00:12] I have a patch to add [23:00:27] Feel free, we only have 5/8 [23:00:37] https://gerrit.wikimedia.org/r/#/c/282827/ [23:01:33] MatmaRex: around? [23:02:08] (shouldn't be far away, we discussed 20 minutes ago) [23:03:53] added to the calendar [23:04:06] k [23:04:11] yes [23:04:31] (in a meeting) [23:04:36] (but my patch should be a no-op anyway) [23:04:37] MatmaRex: by the way, if UploadWizard is upgraded to use extension registration, we can get rid of $wg = $wmg trick [23:04:54] Dereckson: yeah, i did some work towards that recently :D [23:04:59] i think we could actually do it [23:04:59] (03PS2) 10Dereckson: Remove $wgApiFrameOptions override for UploadWizard wikis, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282768 (https://phabricator.wikimedia.org/T131182) (owner: 10Bartosz Dziewoński) [23:05:04] anyway. [23:05:13] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282768 (https://phabricator.wikimedia.org/T131182) (owner: 10Bartosz Dziewoński) [23:05:47] (03Merged) 10jenkins-bot: Remove $wgApiFrameOptions override for UploadWizard wikis, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282768 (https://phabricator.wikimedia.org/T131182) (owner: 10Bartosz Dziewoński) [23:09:22] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Remove $wgApiFrameOptions = SAMEORIGIN override for UploadWizard wikis (Task T131182, [[Gerrit:282768]]) (duration: 00m 27s) [23:09:23] T131182: Remove $wgApiFrameOptions = 'SAMEORIGIN' override for UploadWizard wikis - https://phabricator.wikimedia.org/T131182 [23:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:09:41] MatmaRex: ^ [23:10:18] thanks [23:11:18] MatmaRex: I confirm $wgApiFrameOption value is still well SAMEORIGIN now on commonswiki [23:11:52] i don't really want to upload files in production to test, it should be a no-op. [23:12:13] i'll look at recent changes for a bit to see if people can still upload ;) [23:12:21] if you want $wgApiFrameOption to be to 'SAMEORIGIN', we can consider it's fine [23:12:31] !log mwscript deleteEqualMessages.php --wiki zh_classicalwiki (T45917) [23:12:32] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [23:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:12:47] yeah [23:13:04] (diff | hist) . . m! File:Līksna Parish, Latvia - panoramio - alinco fan (4).jpg‎; 23:12:43 . . (-25)‎ . . ‎Kalbbes (talk | contribs | block)‎ (added Category:Līksna parish; removed {{uncategorized}} using HotCat) [rollback] [23:13:04] hmm [23:13:11] ah not an upload [23:13:29] https://commons.wikimedia.org/wiki/File:Kurgan_in_distance.JPG has been uploaded after the change :) [23:14:19] (03PS2) 10Dereckson: New logo for vec.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282396 (https://phabricator.wikimedia.org/T132185) [23:14:36] (03CR) 10Dereckson: [C: 032] New logo for vec.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282396 (https://phabricator.wikimedia.org/T132185) (owner: 10Dereckson) [23:14:44] (03CR) 10Dereckson: "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282396 (https://phabricator.wikimedia.org/T132185) (owner: 10Dereckson) [23:15:01] (03Merged) 10jenkins-bot: New logo for vec.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282396 (https://phabricator.wikimedia.org/T132185) (owner: 10Dereckson) [23:16:37] Dereckson: hmm. it should be DENY. i think you also need to sync CommonSettings.php [23:17:17] MatmaRex: oh right [23:17:29] !log dereckson@tin Synchronized static/images/project-logos/vecwiki.png: New logo for vec.wikipedia (and end of celebration logo) (Task T132185, [[Gerrit:282396]]) (duration: 00m 24s) [23:17:30] T132185: Restore normal project logo for vec.wikipedia (spent 10th anniversary) - https://phabricator.wikimedia.org/T132185 [23:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:18:05] Logo purged, 282396 tested, works fine. [23:18:08] Now resync for your patch. [23:19:18] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Remove $wgApiFrameOptions = SAMEORIGIN override for UploadWizard wikis (Task T131182, [[Gerrit:282768]]) (duration: 00m 31s) [23:19:18] T131182: Remove $wgApiFrameOptions = 'SAMEORIGIN' override for UploadWizard wikis - https://phabricator.wikimedia.org/T131182 [23:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:19:35] > echo $wgApiFrameOptions; [23:19:35] DENY [23:19:36] better :) [23:20:16] aha, that did it. thanks! [23:20:21] MatmaRex: and a new file uploaded: https://commons.wikimedia.org/wiki/File:%D8%A7%D9%84%D9%85%D8%B3%D8%AC%D8%AF_%D8%A7%D9%84%D8%A7%D9%82%D8%B5%D9%89_%D9%88%D8%A7%D9%84%D9%87%D9%8A%D9%83%D9%84.jpg [23:20:26] You're welcome. [23:20:28] !log mwscript deleteEqualMessages.php --wiki idwiki (T45917) [23:20:29] T45917: Delete all redundant "MediaWiki" pages for system messages - https://phabricator.wikimedia.org/T45917 [23:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:22:10] (03PS2) 10Dereckson: Set wgSemiprotectedRestrictionLevels for fr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282469 (https://phabricator.wikimedia.org/T132248) [23:22:28] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282469 (https://phabricator.wikimedia.org/T132248) (owner: 10Dereckson) [23:22:59] (03Merged) 10jenkins-bot: Set wgSemiprotectedRestrictionLevels for fr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282469 (https://phabricator.wikimedia.org/T132248) (owner: 10Dereckson) [23:25:20] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Set wgSemiprotectedRestrictionLevels for fr.wikipedia (Task T132248, [[Gerrit:282469]]) (duration: 00m 29s) [23:25:21] T132248: Set wgSemiprotectedRestrictionLevels for fr.wikipedia - https://phabricator.wikimedia.org/T132248 [23:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:25:47] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [5000000.0] [23:26:16] Works. class="mw-textarea-sprotected" [23:26:45] Krinkle: would you have access to http://commons.wikimedia.beta.wmflabs.org/wiki/Special:GWToolset ? [23:27:25] Dereckson: I do I think [23:27:28] (03PS2) 10Dereckson: Add mergehistory right to eliminator group on ja.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282055 (https://phabricator.wikimedia.org/T131751) [23:27:36] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282055 (https://phabricator.wikimedia.org/T131751) (owner: 10Dereckson) [23:27:55] Krinkle: currently, there is only 3 domains: flickr + static flickr + upload.wikimedia.org, right? [23:27:57] http://meta.wikimedia.beta.wmflabs.org/wiki/Special:CentralAuth/Krinkle says yes. [23:28:03] (03Merged) 10jenkins-bot: Add mergehistory right to eliminator group on ja.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282055 (https://phabricator.wikimedia.org/T131751) (owner: 10Dereckson) [23:28:15] (03PS1) 10Andrew Bogott: Use half-baked ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) [23:28:22] upload.wikimedia.org [23:28:23] *.flickr.com [23:28:23] *.staticflickr.com [23:28:28] Yes [23:29:13] (03CR) 10Andrew Bogott: "This patch is tested and seems to work." [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott) [23:29:37] (03CR) 10jenkins-bot: [V: 04-1] Use half-baked ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott) [23:30:02] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Add mergehistory right to eliminator group on ja.wikipedia (Task T131751, [[Gerrit:282055]]) (duration: 00m 29s) [23:30:03] T131751: JAWP: Enable 'mergehistory' to Eliminators - https://phabricator.wikimedia.org/T131751 [23:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:30:30] Okay, 282055 works. [23:30:54] (03PS5) 10Dereckson: Fix wgCopyUploadsDomains on Commons Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282495 (https://phabricator.wikimedia.org/T132285) [23:31:05] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282495 (https://phabricator.wikimedia.org/T132285) (owner: 10Dereckson) [23:31:21] Thanks Krinkle. [23:31:32] (03Merged) 10jenkins-bot: Fix wgCopyUploadsDomains on Commons Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282495 (https://phabricator.wikimedia.org/T132285) (owner: 10Dereckson) [23:33:03] (03PS2) 10Andrew Bogott: Use half-baked ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) [23:33:41] !log dereckson@tin Synchronized wmf-config/CommonSettings-labs.php: Fix wgCopyUploadsDomains on Commons Beta (Task T132285, [[Gerrit:282495]]) — 1/2 (duration: 00m 26s) [23:33:42] T132285: Broken upload-by-url whitelist on Commons Beta - https://phabricator.wikimedia.org/T132285 [23:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:21] !log dereckson@tin Synchronized wmf-config/InitialiseSettings-labs.php: Fix wgCopyUploadsDomains on Commons Beta (Task T132285, [[Gerrit:282495]]) — 2/2 (duration: 00m 25s) [23:34:22] T132285: Broken upload-by-url whitelist on Commons Beta - https://phabricator.wikimedia.org/T132285 [23:34:23] (03CR) 10jenkins-bot: [V: 04-1] Use half-baked ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) (owner: 10Andrew Bogott) [23:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:38] Krinkle: and now you have AND flickr AND the long list of domains in prod AND upload.wikimedia.org? [23:35:03] oh wait [23:35:19] we need to wait Jenkins beta job to pickup the change [23:35:20] I see collectie.legermuseum.nl and many other domains now [23:35:26] ah perfect [23:35:34] and upload. still present? [23:35:41] And ends with: [23:35:42] *.museumvictoria.com.au [23:35:42] *.flickr.com [23:35:42] *.staticflickr.com [23:35:42] upload.wikimedia.org [23:35:51] Works so. [23:35:53] Thanks. [23:36:32] (03PS3) 10Andrew Bogott: Use half-baked ldap auth for librenms [puppet] - 10https://gerrit.wikimedia.org/r/282830 (https://phabricator.wikimedia.org/T107702) [23:37:09] 07Blocked-on-Operations, 10Datasets-Archiving, 10Dumps-Generation, 10Flow, 03Collab-Team-2016-Q4: Publish recurring Flow dumps at http://dumps.wikimedia.org/ - https://phabricator.wikimedia.org/T119511#2197574 (10jmatazzoni) [23:37:40] Krenair: I see https://gerrit.wikimedia.org/r/#/c/282827/ is for wmf/1.27.0-wmf.20, so first I need to cherry-pick it to wmf21? [23:37:58] Dereckson: wmf.21 gets cut tomorrow morning. [23:38:15] so it's for wmf.20 and not wmf.21 you want a deploy? [23:38:17] PROBLEM - Disk space on elastic1001 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 79968 MB (15% inode=99%) [23:38:21] nope, I just wrote down the wrong number [23:38:25] it's for wmf.20 [23:38:28] k [23:38:33] sigh [23:38:35] good catch [23:38:52] only wmf.20 is actually deployed at the moment (you can check with `mwversionsinuse`) [23:39:00] 06Operations, 10Flow, 10MediaWiki-Redirects, 03Collab-Team-2016-Q4, and 2 others: Flow notification links on mobile point to desktop - https://phabricator.wikimedia.org/T107108#2197600 (10jmatazzoni) [23:42:34] 06Operations, 10Ops-Access-Requests: Grant reedy access to librenms - https://phabricator.wikimedia.org/T131252#2161064 (10Andrew) I've written a new patch that implements ldap auth. Once that merges, all Ops will have access. Of course, Reedy is not an Op. Is there any naturally pre-existing ldap group tha... [23:46:34] 06Operations, 10Ops-Access-Requests: Grant reedy access to librenms - https://phabricator.wikimedia.org/T131252#2197633 (10Krenair) There is a list at https://wikitech.wikimedia.org/wiki/LDAP_Groups Other than wmf/nda (which it sounds like you're not prepared to trust with this) I don't see an appropriate one.... [23:53:19] (03CR) 10Jforrester: [C: 04-2] "Actually for now we're going to remove all the code except importScript etc., and then do this later." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277823 (owner: 10Jforrester) [23:54:39] James_F: for your change, a sync-file on modules/ve-mw/init/ve.init.mw.ArticleTargetLoader.jswill be enough? [23:54:58] Dereckson: Ask Krenair, he wrote it. [23:56:56] and put my name on the calendar for it :p [23:57:03] yep, that should be fine Dereckson [23:57:06] k