[00:00:05] <jouncebot>	 twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T0000).
[00:03:45] <icinga-wm>	 PROBLEM - Disk space on ruthenium is CRITICAL: DISK CRITICAL - free space: / 1775 MB (3% inode=90%)
[00:04:53] <mutante>	 !log ruthenium - apt-get clean gets a little more disk space
[00:04:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:06:05] <icinga-wm>	 RECOVERY - puppet last run on mw1277 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[00:06:14] <twentyafterfour>	 !log updating phabricator on iridium
[00:06:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:06:42] <twentyafterfour>	 y
[00:06:58] <mutante>	 subbu: hi, ideas on things we could delete on ruthenium? it seems low on disk
[00:07:12] <mutante>	 32G visualdiff
[00:07:59] <mutante>	 yea, it's almost all /srv/visualdiff/pngs
[00:08:19] <mutante>	 but since that is on / we need to rotate something 
[00:10:11] <mutante>	 !log ruthenium low on disk space, because /srv/visualdiff/pngs (parsoid-vd-tests) is pretty large and /srv isn't a separate mount
[00:10:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:10:35] <twentyafterfour>	 !log Phabricator update completed.
[00:10:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:11:52] <mutante>	 twentyafterfour: :) smooth update it seems. anything interesting that changed in phab?
[00:18:24] <twentyafterfour>	 mutante: search results should be a bit better
[00:18:36] <twentyafterfour>	 I fixed stemming and exact phrase matches
[00:19:09] <twentyafterfour>	 so it'll match agreements when you search for agreement
[00:21:43] <twentyafterfour>	 also git repository polling should be more efficient
[00:25:48] <mutante>	 twentyafterfour: cool :) thank you
[00:43:35] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 569.00 seconds
[00:49:35] <icinga-wm>	 PROBLEM - puppet last run on db1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:51:45] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 81459.999543 Seconds
[00:52:35] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 81511.605569 Seconds
[00:52:35] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 81511.61029 Seconds
[00:54:35] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds
[00:54:35] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds
[00:54:45] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds
[00:56:44] <wikibugs__>	 (03CR) 10Dzahn: [C: 031] "here is one more for (after) Friday" [puppet] - 10https://gerrit.wikimedia.org/r/337207 (https://phabricator.wikimedia.org/T143349) (owner: 10Dzahn)
[01:01:21] <wikibugs>	 (03PS1) 1020after4: Phabricator: Fix up elasticsearch cluster config [puppet] - 10https://gerrit.wikimedia.org/r/345488
[01:01:22] <twentyafterfour>	 mutante: still around?
[01:01:44] <wikibugs__>	 (03CR) 10Dzahn: [V: 031 C: 031] "wikimedia.bytemark.co.uk has address 212.110.173.211" [puppet] - 10https://gerrit.wikimedia.org/r/345323 (https://phabricator.wikimedia.org/T159331) (owner: 10Reedy)
[01:01:48] <mutante>	 twentyafterfour: yes
[01:02:09] <twentyafterfour>	 mind merging a config change for phabricator?
[01:02:12] <twentyafterfour>	 https://gerrit.wikimedia.org/r/#/c/345488/
[01:03:15] <wikibugs>	 (03CR) 1020after4: "This also removes mysql from the config since it's been disabled for a while, no need to keep it defined." [puppet] - 10https://gerrit.wikimedia.org/r/345488 (owner: 1020after4)
[01:03:57] <mutante>	 ok, one moment
[01:04:09] <wikibugs__>	 (03CR) 10Dzahn: [C: 031] "https://wikimedia.bytemark.co.uk/" [puppet] - 10https://gerrit.wikimedia.org/r/345325 (https://phabricator.wikimedia.org/T159331) (owner: 10Reedy)
[01:06:32] <mutante>	 so in each data center, read is false but write is true, for itself
[01:07:05] <mutante>	 eh, for the other one.. 
[01:07:18] <twentyafterfour>	 no it should be "write: true" for all,  and read: true for same data center
[01:07:31] <twentyafterfour>	 so it'll keep all indexes updated and read from the closest one
[01:08:37] <twentyafterfour>	 the main thing is that it's got two clusters with one host instead of one cluster with two hosts
[01:09:08] <twentyafterfour>	 the code treats clusters as separate and hosts as interchangeable for writes....so the old config wouldn't keep both data centers in sync properly
[01:10:00] <mutante>	 ok, i get it now, just had to stare at the 3 hiera files again
[01:10:14] <mutante>	 aha @ clusters
[01:10:38] <wikibugs__>	 (03CR) 10Dzahn: [C: 032] Phabricator: Fix up elasticsearch cluster config [puppet] - 10https://gerrit.wikimedia.org/r/345488 (owner: 1020after4)
[01:10:42] <twentyafterfour>	 the old config is correct if eqiad and codfw were a single es cluster but they aren't
[01:10:55] <mutante>	 ok, yes
[01:10:58] <twentyafterfour>	 thanks!
[01:11:22] <mutante>	 merged on master, i'll let you run puppet on server
[01:11:26] <twentyafterfour>	 ok
[01:11:37] <twentyafterfour>	 !log running `puppet agent --test` on iridium
[01:11:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:17:09] <mutante>	 twentyafterfour: i assume you are in the middle of switching it. so the current exception on search form isn't unexpected
[01:17:35] <icinga-wm>	 RECOVERY - puppet last run on db1009 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[01:22:30] <xaosflux>	 hi chanel - should I create a phab ticket for phab search being broken?
[01:22:47] <xaosflux>	 Unhandled Exception ("PhutilAggregateException") All of the configured Fulltext Search services failed.     - PhutilAggregateException: All Fulltext Search hosts failed:     - PhutilAggregateException: All Fulltext Search hosts failed:
[01:24:35] <xaosflux>	 guess everyone is lurking
[01:24:35] <xaosflux>	 https://phabricator.wikimedia.org/T161772
[01:27:59] <twentyafterfour>	 mutante: it's behaving incorrectly
[01:28:06] <twentyafterfour>	 trying to fix it
[01:28:17] <mutante>	 twentyafterfour: ok!
[01:28:47] <mutante>	 user created a ticket, i'm making food but i'll check here
[01:29:11] <twentyafterfour>	 thanks!
[01:29:45] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 83649.020591 Seconds
[01:29:45] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 83739.956605 Seconds
[01:29:45] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 83653.760098 Seconds
[01:30:35] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 83791.462606 Seconds
[01:30:35] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 83791.476055 Seconds
[01:31:55] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 83779.205899 Seconds
[01:33:35] <wikibugs>	 (03PS1) 1020after4: Turn on read: true to both clusters for now [puppet] - 10https://gerrit.wikimedia.org/r/345491
[01:33:45] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds
[01:35:18] <wikibugs__>	 (03PS2) 1020after4: Turn on read: true to both clusters for now [puppet] - 10https://gerrit.wikimedia.org/r/345491
[01:36:45] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 84073.680358 Seconds
[01:36:58] <wikibugs__>	 (03PS3) 1020after4: Turn on read: true to both clusters for now [puppet] - 10https://gerrit.wikimedia.org/r/345491
[01:37:45] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds
[01:38:19] <wikibugs>	 (03PS4) 1020after4: Turn on read: true to both clusters for now [puppet] - 10https://gerrit.wikimedia.org/r/345491 (https://phabricator.wikimedia.org/T161772)
[01:40:21] <twentyafterfour>	 mutante: ^ when you get a chance
[01:40:45] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 84399.866128 Seconds
[01:41:38] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Turn on read: true to both clusters for now [puppet] - 10https://gerrit.wikimedia.org/r/345491 (https://phabricator.wikimedia.org/T161772) (owner: 1020after4)
[01:41:59] <mutante>	 twentyafterfour: go ahead with puppet
[01:42:23] <twentyafterfour>	 thanks
[01:43:50] <twentyafterfour>	 mutante: works! :)
[01:43:58] <mutante>	 search works for me
[01:44:03] <mutante>	 was about to say :) ok, cool
[01:44:24] <twentyafterfour>	 now I just need to kill that bogus exception but it'll be good like it is for now
[01:44:35] <icinga-wm>	 PROBLEM - PHD should be supervising processes on iridium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 997 (phd)
[01:44:37] <mutante>	 ok, great
[01:44:42] <mutante>	 eh, except that i guess
[01:44:43] <twentyafterfour>	 uhm..
[01:46:55] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds
[01:47:02] <twentyafterfour>	 weird,
[01:47:06] <twentyafterfour>	 it says it's running but it's not
[01:47:21] <twentyafterfour>	 " sudo service phd status
[01:47:23] <twentyafterfour>	 phd start/running
[01:47:25] <twentyafterfour>	 "
[01:48:11] <mutante>	 stop it fully and start it again, vs restart?
[01:48:23] <mutante>	 did it get restarted by config change?
[01:48:25] <twentyafterfour>	 tried that
[01:48:45] <twentyafterfour>	 it got restarted by me manually restarting it (just to be sure it picked up everything)
[01:48:45] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds
[01:48:54] * twentyafterfour should have logged that
[01:49:28] <mutante>	 is manual restart the same that puppet does ?
[01:49:36] <twentyafterfour>	 should be
[01:49:38] <mutante>	 maybe a pid file owned by wrong user now 
[01:49:43] <mutante>	 looks
[01:49:54] <twentyafterfour>	 Daemon 586105 STDE [Thu, 30 Mar 2017 01:42:24 +0000] <SGNL> Caught signal 15 (SIGTERM).
[01:49:55] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 84859.322478 Seconds
[01:49:56] <twentyafterfour>	 Daemon 586105 FAIL [Thu, 30 Mar 2017 01:42:24 +0000] Process exited with error 143.
[01:50:06] <mutante>	 sigh
[01:50:20] <twentyafterfour>	 oh wait I see something ...
[01:50:47] <mutante>	 should have seen that happen.. ariel's law
[01:52:27] <twentyafterfour>	 !log phd fixed on iridium. libphutil was out of sync with phd source
[01:52:30] <mutante>	 finds random pastes by others like https://secure.phabricator.com/P1729
[01:52:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:52:35] <icinga-wm>	 RECOVERY - PHD should be supervising processes on iridium is OK: PROCS OK: 21 processes with UID = 997 (phd)
[01:52:41] <mutante>	 heh
[01:52:44] <mutante>	 :) 
[01:53:08] <twentyafterfour>	 thanks mutante, sorry for the alerts, that one was my fault
[01:54:15] <icinga-wm>	 PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:54:32] <mutante>	 no problem. looks we are good then. i'll step out then. cu later twentyafterfour 
[01:54:45] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 85155.223919 Seconds
[01:54:59] <mutante>	 if you need an emergency merge, SMS to number in office wiki list
[01:55:07] <twentyafterfour>	 thanks! have a good evening
[01:55:14] <mutante>	 thanks, bye
[01:56:35] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 37.877426 Seconds
[01:56:35] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 37.875977 Seconds
[01:56:45] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 49.074035 Seconds
[01:56:45] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 50.325248 Seconds
[01:56:45] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 52.234984 Seconds
[01:56:56] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 56.221606 Seconds
[01:57:15] <icinga-wm>	 RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[02:22:59] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.17) (duration: 07m 20s)
[02:23:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:36:05] <icinga-wm>	 PROBLEM - puppet last run on es1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:57:42] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.18) (duration: 13m 43s)
[02:57:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:59:22] <c>	 it doesn't say , master anymore :(
[03:03:05] <icinga-wm>	 RECOVERY - puppet last run on es1019 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[03:03:31] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Mar 30 03:03:31 UTC 2017 (duration 5m 49s)
[03:03:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:23:45] <icinga-wm>	 RECOVERY - Disk space on ruthenium is OK: DISK OK
[03:25:19] <subbu>	 mutante, parsoid logs in /srv/log/parsoid tend to grow big. i deleted them for now ... but, it'll start filling up again on the next test run. let us chat tomorrow about limiting log sizes there.
[04:22:24] <wikibugs__>	 (03Abandoned) 10Subramanya Sastry: Allow parsoid-vd-client service to be controlled outside systemd [puppet] - 10https://gerrit.wikimedia.org/r/344961 (owner: 10Subramanya Sastry)
[04:38:34] <wikibugs__>	 (03PS2) 10Felipe L. Ewald: Add Bytemark to public_mirrors.html list [puppet] - 10https://gerrit.wikimedia.org/r/345325 (https://phabricator.wikimedia.org/T159331) (owner: 10Reedy)
[04:51:25] <icinga-wm>	 PROBLEM - puppet last run on db1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:00:55] <icinga-wm>	 PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[05:17:05] <icinga-wm>	 PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 29 probes of 426 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map
[05:19:25] <icinga-wm>	 RECOVERY - puppet last run on db1024 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[05:29:55] <icinga-wm>	 RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[05:32:05] <icinga-wm>	 RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 15 probes of 426 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map
[05:43:38] <wikibugs__>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345495
[05:43:42] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345495
[05:47:37] <wikibugs__>	 06Operations, 10Monitoring: Add slabinfo prometheus exporter - https://phabricator.wikimedia.org/T160071#3142788 (10ema)
[05:51:25] <ema>	 !log upgrading twisted to 16.2.0 on lvs100[456] (eqiad secondaries) T160433
[05:51:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:51:32] <stashbot>	 T160433: Upgrade twisted on load balancers to 16.2.0 - https://phabricator.wikimedia.org/T160433
[05:53:36] <wikibugs__>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345495 (owner: 10Marostegui)
[05:54:50] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345495 (owner: 10Marostegui)
[05:55:01] <wikibugs__>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345495 (owner: 10Marostegui)
[05:56:07] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1094 - T17441 (duration: 00m 45s)
[05:56:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:56:13] <stashbot>	 T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441
[05:56:43] <marostegui>	 !log Deploy schema change on db2014 - codfw master (this will generate lag on codfw) - T73563
[05:56:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:56:49] <stashbot>	 T73563: *_minor_mime are varbinary(32) on WMF sites, out of sync with varbinary(100) in MW core - https://phabricator.wikimedia.org/T73563
[06:04:50] <wikibugs__>	 06Operations, 10ops-eqiad, 10DBA, 10Phabricator: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3142794 (10Marostegui) 05Resolved>03Open This has happened again, so maybe the BBU is indeed faulty. ``` root@db1048:~# date ; mysql --skip-ssl -e "show slave status\G" | grep...
[06:06:45] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA, 10Phabricator: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3142797 (10Marostegui) a:05Marostegui>03Cmjohnson
[06:10:12] <wikibugs>	 06Operations, 10Pybal, 10Traffic, 13Patch-For-Review: Pybal IdleConnectionMonitor with TCP KeepAlive shows random fails if more than 100 servers are involved. - https://phabricator.wikimedia.org/T119372#3142801 (10ema)
[06:11:05] <wikibugs__>	 06Operations, 10Pybal, 10Traffic: Make PyBal respect advertised BGP capabilities - https://phabricator.wikimedia.org/T81305#3142806 (10ema)
[06:11:27] <wikibugs>	 06Operations, 10Pybal, 10Traffic: Add pybal check to ensure service IP is bound - https://phabricator.wikimedia.org/T79730#3142807 (10ema)
[06:25:41] <marostegui>	 !log Logging backwards for the record: restart mysql on db1047 for maintenance - T160454
[06:25:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:25:49] <stashbot>	 T160454: Change length of userAgent column on EL tables  - https://phabricator.wikimedia.org/T160454
[06:27:35] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.00 seconds
[06:28:59] <wikibugs__>	 06Operations, 10ops-eqiad, 10DBA, 10Phabricator: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3142822 (10Marostegui) ``` ˜/icinga-wm 8:27> RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.00 seconds ```  ``` Battery State: Optimal ```  `...
[06:32:15] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove a number of obsolete conditionals from mediawiki classes [puppet] - 10https://gerrit.wikimedia.org/r/345502
[06:45:01] <moritzm>	 !log installing apparmor security updates on trusty
[06:45:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:57:25] <icinga-wm>	 PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[percona-toolkit]
[06:58:00] <wikibugs>	 (03CR) 10Elukey: [C: 031] Remove a number of obsolete conditionals from mediawiki classes [puppet] - 10https://gerrit.wikimedia.org/r/345502 (owner: 10Muehlenhoff)
[07:00:45] <icinga-wm>	 PROBLEM - puppet last run on mw1168 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[apparmor]
[07:03:10] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete mediawiki::packages::legacy class [puppet] - 10https://gerrit.wikimedia.org/r/345505
[07:05:13] <ema>	 !log upgrading twisted to 16.2.0 on lvs100[123] (eqiad primaries) T160433
[07:05:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:20] <stashbot>	 T160433: Upgrade twisted on load balancers to 16.2.0 - https://phabricator.wikimedia.org/T160433
[07:10:45] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
[07:12:20] <ema>	 brief 5xx spike in esams text ^
[07:12:35] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[07:18:05] <icinga-wm>	 PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0]
[07:18:35] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[07:19:45] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[07:24:47] <wikibugs>	 06Operations, 10Pybal, 10Traffic: Unhandled pybal ValueError: need more than 1 value to unpack - https://phabricator.wikimedia.org/T143078#3142889 (10ema) This problem should be [[https://github.com/twisted/twisted/commit/942b63cc04fba83dabf1958b3ed24af860778681|solved upstream]]. I've just finished upgradin...
[07:25:25] <icinga-wm>	 RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures
[07:28:45] <icinga-wm>	 RECOVERY - puppet last run on mw1168 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[07:29:59] <wikibugs__>	 (03PS1) 10Muehlenhoff: Add some more fine-grained debdeploy server groups for openstack [puppet] - 10https://gerrit.wikimedia.org/r/345506
[07:31:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add some more fine-grained debdeploy server groups for openstack [puppet] - 10https://gerrit.wikimedia.org/r/345506 (owner: 10Muehlenhoff)
[07:39:57] <wikibugs>	 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3142970 (10Gehel) After multiple tests, generating CPU, memory and IO load on elastic2021, the server has not crashed. Those tests are the same...
[07:41:40] <gehel>	 !log pull elastic2021 back into active duty - T149006
[07:41:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:47] <stashbot>	 T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006
[07:43:57] <logmsgbot>	 !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic2021.codfw.wmnet
[07:44:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:45:20] <wikibugs__>	 (03CR) 10Elukey: [C: 031] Remove obsolete mediawiki::packages::legacy class [puppet] - 10https://gerrit.wikimedia.org/r/345505 (owner: 10Muehlenhoff)
[07:48:57] <wikibugs>	 06Operations, 10Pybal, 10Traffic: Upgrade twisted on load balancers to 16.2.0 - https://phabricator.wikimedia.org/T160433#3142982 (10ema) 05Open>03Resolved
[07:57:33] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345507 (https://phabricator.wikimedia.org/T17441)
[07:57:46] <wikibugs>	 (03PS2) 10Elukey: Prepare analytics1027 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/345117 (https://phabricator.wikimedia.org/T161597)
[07:58:37] <wikibugs>	 (03CR) 10Elukey: "Thanks @BBlack, should be ok now!" [puppet] - 10https://gerrit.wikimedia.org/r/345117 (https://phabricator.wikimedia.org/T161597) (owner: 10Elukey)
[08:00:53] <wikibugs__>	 06Operations, 06Commons, 10Traffic, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517#3142987 (10Nemo_bis) > I don't think we've been aware of the uselang hack or its mechanics before  The documentatio...
[08:01:25] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345507 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui)
[08:02:54] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345507 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui)
[08:03:11] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1090 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345507 (https://phabricator.wikimedia.org/T17441) (owner: 10Marostegui)
[08:03:50] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1090 - T17441 (duration: 00m 44s)
[08:03:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:56] <stashbot>	 T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441
[08:13:21] <marostegui>	 !log Convert UNIQUE keys to PK on db1090 (s2) - T17441
[08:13:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:27] <stashbot>	 T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441
[08:18:05] <icinga-wm>	 PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0]
[08:19:01] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: Use internal url for Ores, move to ProductionServices.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/316317
[08:19:03] <wikibugs__>	 (03PS1) 10Giuseppe Lavagetto: Convert parsoid, mathoid, restbase urls to use the discovery hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345508
[08:19:05] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Convert cxserver, eventbus to use the discovery url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345509
[08:19:07] <wikibugs__>	 (03PS1) 10Giuseppe Lavagetto: Use discovery url for Ores as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345510
[08:23:12] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I don't think having case-sensitive regexes when we have case-insensitive equality is the way to go:" [software/cumin] - 10https://gerrit.wikimedia.org/r/345402 (https://phabricator.wikimedia.org/T161730) (owner: 10Volans)
[08:25:05] <icinga-wm>	 RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0]
[08:34:38] <elukey>	 hashar: o/ - I saw your comment in https://github.com/phpredis/phpredis/issues/562 and I wanted to ask what would it take to upgrade phpredis
[08:34:55] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.073 second response time
[08:34:56] <icinga-wm>	 RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 73873 bytes in 0.289 second response time
[08:35:08] <hashar>	 elukey: hello
[08:35:15] <icinga-wm>	 RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.028 second response time
[08:35:39] <hashar>	 elukey: so I think you/someone or a task has lead me to that issue
[08:35:50] <hashar>	 and i merely explicitly listed the versions being used
[08:35:53] <elukey>	 yes the redis timeouts :)
[08:36:30] <hashar>	 in Zend world that would be a php5-redis.deb package or something that we would need to bump
[08:36:46] <hashar>	 in HHVM I have no clue. Maybe it uses the Zend extension
[08:37:46] <elukey>	 yes yes I was reading https://www.mediawiki.org/wiki/Redis
[08:38:24] <moritzm>	 !log repooling mw1261 to reproduce hhvm deadlock with higher debug level
[08:38:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:04] <elukey>	 thanks hashar :)
[08:40:22] <hashar>	 elukey: well I am not sure how helpful I am here :(
[08:41:15] <hashar>	 hhvm apparently implement redis in plain PHP via hphp/system/php/redis/*.php
[08:42:08] <elukey>	 I am checking it, and if we have the same quit issue in there too
[08:43:11] <hashar>	 depends on whether the jobrunner service daemon/loop uses zend or hhvm I guess
[08:44:19] <hashar>	 /usr/bin/php /srv/deployment/jobrunner/jobrunner/redisJobRunnerService --config-file=/etc/jobrunner/jobrunner.conf 
[08:44:24] <hashar>	 $ readlink -f /usr/bin/php
[08:44:24] <hashar>	 /usr/bin/hhvm
[08:44:44] <hashar>	 elukey: so most probably the github issue is for the Zend extension, and we would have to look at hhvm php code
[08:44:59] <elukey>	 https://gerrit.wikimedia.org/r/#/c/85003/1/includes/clientpool/RedisConnectionPool.php :)
[08:45:59] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA, 10Phabricator: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3143079 (10jcrespo) I believe there was yesterday maintenance or trouble on Phabricator. I would ask RelEng first.
[08:46:44] <hashar>	 ahh I miss ori :(
[08:47:48] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA, 10Phabricator: db1048 BBU Faulty - slave lagging - https://phabricator.wikimedia.org/T160731#3143084 (10Marostegui) Yep, the deployment page said there was a phabricator update so maybe that put more stress on the server and made the BBU fail (again)?  Because the fact tha...
[08:48:11] <elukey>	 we all miss him :)
[08:49:05] <icinga-wm>	 PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:52:03] <wikibugs__>	 (03CR) 10Jcrespo: [C: 031] "Let's go for this and let's revisit next week." [puppet] - 10https://gerrit.wikimedia.org/r/345372 (owner: 10Marostegui)
[08:52:56] <hashar>	 elukey: then hhvm open the connection with  $conn = fsockopen($host, $port, $errno, $errstr, $timeout);   and close it with fclose($conn);
[08:53:07] <hashar>	 no idea how that is relevant though :(
[08:55:14] <_joe_>	 elukey: still trying to resolve the "cannot connect" redis problem as it's client-side?
[08:59:14] <wikibugs>	 (03CR) 10Volans: "> I don't think having case-sensitive regexes when we have" [software/cumin] - 10https://gerrit.wikimedia.org/r/345402 (https://phabricator.wikimedia.org/T161730) (owner: 10Volans)
[09:01:54] <wikibugs>	 (03PS3) 10Volans: Uniform maintenance message and indentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343859 (https://phabricator.wikimedia.org/T160178)
[09:03:33] <wikibugs__>	 (03CR) 10Marostegui: [C: 031] Uniform maintenance message and indentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343859 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[09:04:05] <icinga-wm>	 PROBLEM - HHVM rendering on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:04:05] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:04:25] <icinga-wm>	 PROBLEM - Apache HTTP on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[09:04:31] <volans>	 elukey: not me :)
[09:04:36] <elukey>	 ahahha
[09:04:46] <elukey>	 this one is the deadlock that Moritz is investigating
[09:04:58] <elukey>	 depooling mw1261
[09:05:45] <elukey>	 !log depooling mw1261 (hhvm-dump-debug in /tmp/hhvm.98736.bt.)
[09:05:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:15] <logmsgbot>	 !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1261.eqiad.wmnet
[09:06:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:30] <elukey>	 forgot the --quiet :P
[09:06:34] <elukey>	 moritzm: --^
[09:07:06] <elukey>	 _joe_ I am really annoyed but all that spam related to timeouts in logstash :)
[09:08:01] <elukey>	 hashar: I thought https://github.com/wikimedia/mediawiki/blob/wmf/1.29.0-wmf.18/includes/libs/redis/RedisConnectionPool.php#L393 was the closing part..
[09:08:23] <_joe_>	 elukey: the problem is the shitton of lua code that we execute repeatedly and that blocks redis
[09:08:32] <_joe_>	 that's the reason of the connection errors
[09:09:01] <_joe_>	 we can either 1) accept to have a larger timeout at least from jobrunners 2) retry 3) add moar instances
[09:09:02] <elukey>	 _joe_ okok I know but those TCP RSTs are due to the QUIT commands
[09:09:16] <elukey>	 I wanted to open a hhvm issue on gh
[09:09:37] <elukey>	 and I am +1 for a larger jobrunner timeout for redis
[09:09:41] <elukey>	 but I have no clue how to do it :)
[09:09:54] <_joe_>	 why is a TCP RST a bad thing when you close a connection? or do you think that's creating issues in the connectionpool?
[09:12:07] <elukey>	 well afaics the client (hhvm on jobrunners) send a QUIT to redis and then closes the socket, then redis reply with "OK" and gets a RST
[09:12:16] <elukey>	 not really a big deal ok but definitely not clean
[09:12:26] <elukey>	 phpredis fixed the issue avoiding the QUIT
[09:12:34] <wikibugs__>	 (03CR) 10Jcrespo: [C: 031] Uniform maintenance message and indentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343859 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[09:13:58] <moritzm>	 elukey: thanks, that was much quicker this time, having a look
[09:15:42] <elukey>	 hashar: opened https://github.com/facebook/hhvm/issues/7757 :)
[09:17:05] <icinga-wm>	 RECOVERY - puppet last run on labvirt1009 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures
[09:18:45] <icinga-wm>	 RECOVERY - Check systemd state on netmon1001 is OK: OK - running: The system is fully operational
[09:22:45] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:22:55] <icinga-wm>	 PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:25:30] <wikibugs__>	 06Operations, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3143129 (10elukey) Opened https://github.com/facebook/hhvm/issues/7757 for the TCP RSTs.
[09:25:35] <icinga-wm>	 PROBLEM - Check systemd state on mw1261 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:25:55] <icinga-wm>	 PROBLEM - HHVM processes on mw1261 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm
[09:26:26] <moritzm>	 ^set downtime, debugging things
[09:27:23] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "OK, but couple it with the change in line 69 about ruby 1.8. From what I see in If31535d7092d4bb9167fe838253d17a47542349e they were anyway" [puppet] - 10https://gerrit.wikimedia.org/r/345366 (owner: 10Dzahn)
[09:27:43] <wikibugs__>	 (03CR) 10Volans: [C: 032] Uniform maintenance message and indentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343859 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[09:28:51] <wikibugs__>	 (03Merged) 10jenkins-bot: Uniform maintenance message and indentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343859 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[09:29:00] <wikibugs__>	 (03CR) 10jenkins-bot: Uniform maintenance message and indentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343859 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans)
[09:32:35] <icinga-wm>	 PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:34:23] <logmsgbot>	 !log root@tin Synchronized wmf-config/db-codfw.php: Uniform maintenance message and indentation (duration: 00m 44s)
[09:34:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:28] <logmsgbot>	 !log root@tin Synchronized wmf-config/db-eqiad.php: Uniform maintenance message and indentation (duration: 00m 47s)
[09:48:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:37] <volans>	 jynus: AFAICT the root@ is just scap failing to detect the proper user, but just to be sure do you know a quick way to check that everything is fine on tin?
[09:50:00] <jynus>	 volans, did you merge something as root?
[09:50:04] <volans>	 no
[09:50:10] <volans>	 and neither run it as root
[09:50:29] <volans>	 su - volans -c scap sync-file...
[09:50:32] <jynus>	 what issue are you talking about?
[09:50:42] <volans>	 "log root@tin Synch...."
[09:50:55] <jynus>	 ah, uid issues
[09:50:57] <volans>	 the log message from scap says root@, because using os.getlogin() it failed to detect the proper user
[09:51:20] <volans>	 but, just to be on the safe side, I would like to ensure that scap was not run as root
[09:51:45] <icinga-wm>	 RECOVERY - puppet last run on dbproxy1006 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[09:51:52] <volans>	 like if scap do something based on os.getlogin() it might have used the wrong user for something
[09:51:55] <icinga-wm>	 RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[09:52:01] <jynus>	 check the git files
[09:52:11] <jynus>	 you can technically scap as root
[09:52:14] <volans>	 those are ok, also because I merged it manually as my user
[09:52:27] <jynus>	 then I do not see many issue
[09:52:38] <volans>	 only the scap is run through cumin that ssh as root and then do su - volans -c scap...
[09:52:49] <jynus>	 but talk to scap devels, maybe they can fix the user detection
[09:53:02] <jynus>	 or if they see some issue with that
[09:53:06] <jynus>	 I would assume no
[09:53:15] <volans>	 ok, thanks
[09:53:39] <jynus>	 in the past, the only complain was not being able to merge
[09:53:48] <jynus>	 but that should already have an icinga check
[09:54:12] <volans>	 ok, thanks, also getlogin is only used for the logging in the code, so we should be ok
[10:01:35] <icinga-wm>	 RECOVERY - Check systemd state on mw1261 is OK: OK - running: The system is fully operational
[10:01:55] <icinga-wm>	 RECOVERY - HHVM processes on mw1261 is OK: PROCS OK: 6 processes with command name hhvm
[10:01:56] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.393 second response time
[10:02:06] <icinga-wm>	 RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 74017 bytes in 3.717 second response time
[10:02:15] <icinga-wm>	 RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.048 second response time
[10:02:35] <icinga-wm>	 RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[10:03:19] <moritzm>	 !log repooling mw1261 for additional test
[10:03:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:22] <wikibugs__>	 (03CR) 10StudiesWorld: [C: 031] "It seems appropriate given the bug is resolved and it disables it correctly." [puppet] - 10https://gerrit.wikimedia.org/r/345371 (https://phabricator.wikimedia.org/T111760) (owner: 10Dzahn)
[10:21:05] <wikibugs__>	 (03CR) 10Alexandros Kosiaris: [C: 032] "This will probably require more iterations as we switch authz models (kubernetes is still actively developing them), but looks fine for no" [puppet] - 10https://gerrit.wikimedia.org/r/345187 (owner: 10Dduvall)
[10:21:10] <wikibugs__>	 (03PS3) 10Alexandros Kosiaris: k8s: Accept any given api server authorization mode [puppet] - 10https://gerrit.wikimedia.org/r/345187 (owner: 10Dduvall)
[10:21:13] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] k8s: Accept any given api server authorization mode [puppet] - 10https://gerrit.wikimedia.org/r/345187 (owner: 10Dduvall)
[10:25:44] <wikibugs__>	 (03CR) 10Hashar: "That is the same as https://gerrit.wikimedia.org/r/#/c/343309/ or am I confusing files somehow? :]" [puppet] - 10https://gerrit.wikimedia.org/r/345505 (owner: 10Muehlenhoff)
[10:27:17] <wikibugs>	 (03CR) 10Elukey: "Hi Diego," [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/318068 (owner: 10R4q3NWnUx2CEhVyr)
[10:30:55] <wikibugs__>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Granting wmde group access to grafana-admin.wikimedia.org - https://phabricator.wikimedia.org/T161484#3143260 (10MoritzMuehlenhoff) 05Open>03declined @Addshore: I'm sorry, but we have to decline that request: We've discussed this in the TechOps mee...
[10:47:05] <wikibugs__>	 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-fgiunchedi: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640#3143303 (10fgiunchedi) 1031's port was also a member of labs-instances vlan, removed the port from there and disabled/enabled the port and now 1031 can pxe-boot.  1...
[10:49:10] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Remove obsolete mediawiki::packages::legacy class [puppet] - 10https://gerrit.wikimedia.org/r/345505 (owner: 10Muehlenhoff)
[10:51:54] <wikibugs__>	 (03PS2) 10Muehlenhoff: mediawiki: remove Precise class packages::legacy [puppet] - 10https://gerrit.wikimedia.org/r/343309 (https://phabricator.wikimedia.org/T158652) (owner: 10Hashar)
[10:52:42] <hashar>	 elukey: so I have a terrible theory.  The HHVM redis code close the connection by sending 'QUIT' then immediately closing the socket
[10:53:11] <hashar>	 elukey: but the redis server might be sending back to client an acknowledgement eg: "OK"
[10:53:28] <hashar>	 but since the connection got closed, maybe that trigger a TCP RST
[10:54:09] <hashar>	 the thing I dont get is your wireshark capture shows the RST coming from the client
[10:54:32] <hashar>	 but that match the phpredis bug
[10:55:04] <hashar>	 so in the end most probably the HHVM implementation of Redis.close() should wait for an OK after QUIT
[10:55:12] <hashar>	 eg what phpredis has done
[10:56:30] <elukey>	 hashar: this is the exact theory that I have :)
[10:56:58] <hashar>	 yeah so if two of us + phpredis patch  have the same conspiracy theory
[10:57:00] <hashar>	 it must be true!
[10:57:22] <hashar>	 so maybe that close() https://github.com/facebook/hhvm/blob/master/hphp/system/php/redis/Redis.php#L86-L90
[10:57:33] <hashar>	 needs an extra function call to read from the server
[10:58:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] mediawiki: remove Precise class packages::legacy [puppet] - 10https://gerrit.wikimedia.org/r/343309 (https://phabricator.wikimedia.org/T158652) (owner: 10Hashar)
[10:58:31] <hashar>	 elukey: most probably add a call to processStringResponse()
[10:58:35] <hashar>	 should returns OK
[10:58:53] <hashar>	 but I havent read the redis protocol to figure out what payload the server replies with when a client send (str)QUIT
[10:59:13] <elukey>	 it should be a simple "OK"
[10:59:13] <hashar>	 going to lunch &
[10:59:18] <hashar>	 hopefully yeah
[10:59:20] <elukey>	 phpredis solved it not calling QUIT :)
[10:59:26] <hashar>	 should be easily testable
[10:59:45] <hashar>	 with a mock tcp server that handles QUIT and yield after x miliseconcs OK
[11:00:04] <hashar>	 that should show the RST happening since the socket is closed before 'OK' is sent back
[11:00:20] <hashar>	 yeah or skip calling quit. But I am not sure how redis server will handle that
[11:00:21] <hashar>	 ;D
[11:00:28] <_joe_>	 paravoid: did I say I didn't have time to work on the jobrunners, did I?
[11:00:43] <_joe_>	 jouncebot: !next
[11:00:48] <_joe_>	 uhm
[11:00:50] <_joe_>	 !next
[11:00:53] <hasharAway>	 jouncebot: refresh
[11:00:56] <jouncebot>	 I refreshed my knowledge about deployments.
[11:00:57] <hasharAway>	 maybe it died
[11:00:58] <_joe_>	 !next
[11:01:00] <paravoid>	 huh?
[11:01:01] <hasharAway>	 jouncebot: next
[11:01:01] <jouncebot>	 In 1 hour(s) and 58 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T1300)
[11:01:09] <hasharAway>	 somehow jouncebot lost its state
[11:01:16] <_joe_>	 ok... I'm adding a few patches there :)
[11:01:39] <wikibugs__>	 (03PS1) 10Giuseppe Lavagetto: role::mediawiki::common: extract scap::proxy to a profile [puppet] - 10https://gerrit.wikimedia.org/r/345527
[11:01:41] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: role::mediawiki::common: convert to profile (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/345528
[11:01:43] <wikibugs__>	 (03PS1) 10Giuseppe Lavagetto: role::mediawiki::common: convert to profiles (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/345529
[11:01:45] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: move role to profile, unify videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/345530
[11:01:47] <wikibugs__>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: add mediawiki_active_dc function [puppet] - 10https://gerrit.wikimedia.org/r/345531
[11:01:48] <hasharAway>	 ;]
[11:04:36] <_joe_>	 elukey: would you care to review ^^ ?
[11:04:44] <_joe_>	 I know it's a lot :/
[11:05:20] <elukey>	 sure I'll do it!
[11:07:28] <wikibugs__>	 (03CR) 10Elukey: [C: 031] "Tested and it works fine, thanks a lot! This is definitely a bug." [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/317825 (owner: 10R4q3NWnUx2CEhVyr)
[11:27:04] <wikibugs__>	 (03PS2) 10ArielGlenn: Update Bytemark wikimedia mirror hostname [puppet] - 10https://gerrit.wikimedia.org/r/345323 (https://phabricator.wikimedia.org/T159331) (owner: 10Reedy)
[11:27:39] <wikibugs__>	 (03CR) 10Elukey: [C: 031] "LGTM, from https://puppet-compiler.wmflabs.org/5970/mw1280.eqiad.wmnet/ (scap proxy) it seems fine" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/345527 (owner: 10Giuseppe Lavagetto)
[11:28:35] <icinga-wm>	 PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:31:31] <wikibugs>	 (03CR) 10Elukey: [C: 031] "LGTM https://puppet-compiler.wmflabs.org/5971/mw1280.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/345528 (owner: 10Giuseppe Lavagetto)
[11:35:53] <icinga-wm>	 PROBLEM - puppet last run on ms-be1033 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdc1]
[11:36:37] <wikibugs__>	 (03CR) 10ArielGlenn: [C: 032] Update Bytemark wikimedia mirror hostname [puppet] - 10https://gerrit.wikimedia.org/r/345323 (https://phabricator.wikimedia.org/T159331) (owner: 10Reedy)
[11:37:13] <icinga-wm>	 PROBLEM - puppet last run on ms-be1032 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 9 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sdc1]
[11:37:44] <volans>	 godog: new swift backend hosts? ^^^
[11:42:03] <Urbanecm>	 jouncebot, next
[11:42:03] <jouncebot>	 In 1 hour(s) and 17 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T1300)
[11:48:44] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Remove db1057 from s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345535 (https://phabricator.wikimedia.org/T160435)
[11:55:40] <moritzm>	 !log installing jbig2dec security updates
[11:55:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:43] <icinga-wm>	 RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[12:02:16] <moritzm>	 !log installing glibc security updates on trusty
[12:02:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:20] <wikibugs__>	 (03PS1) 10Alexandros Kosiaris: Remove system::role {'config-master': } [puppet] - 10https://gerrit.wikimedia.org/r/345536
[12:03:22] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Remove system::role { 'conftool-master': } [puppet] - 10https://gerrit.wikimedia.org/r/345537
[12:03:24] <wikibugs__>	 (03PS1) 10Alexandros Kosiaris: Move and rename system::role{ 'role::docker::builder':} [puppet] - 10https://gerrit.wikimedia.org/r/345538
[12:11:03] <icinga-wm>	 PROBLEM - puppet last run on mw1214 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:13:24] <wikibugs>	 (03CR) 10Hashar: [C: 031] [cleanup] Remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344991 (https://phabricator.wikimedia.org/T161530) (owner: 10Urbanecm)
[12:13:52] <wikibugs__>	 (03CR) 10Hashar: [C: 031] Allow eliminators and autoreviewers to move a file on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344989 (https://phabricator.wikimedia.org/T161532) (owner: 10Urbanecm)
[12:14:44] <wikibugs>	 (03CR) 10Hashar: [C: 031] Assign move-categorypages to sysops&bots only on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345093 (https://phabricator.wikimedia.org/T161551) (owner: 10Urbanecm)
[12:15:48] <wikibugs>	 (03CR) 10Hashar: [C: 031] Enable Multimedia Viewer at officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342798 (https://phabricator.wikimedia.org/T160420) (owner: 10Urbanecm)
[12:15:58] <hoo>	 !log Updated the Constraints table on Wikidata, per T160506.
[12:16:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:05] <stashbot>	 T160506: Update constraint table - https://phabricator.wikimedia.org/T160506
[12:17:04] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "I explicitly verified name resolution is correct for all entries and is  equivalent to our preceding configuration." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345508 (owner: 10Giuseppe Lavagetto)
[12:18:32] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=eventbus,name=codfw
[12:18:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:16] <wikibugs>	 (03CR) 10Elukey: [C: 031] "LGTM, ran pcc also with some mcXXXX hosts https://puppet-compiler.wmflabs.org/5972 even if not necessary." [puppet] - 10https://gerrit.wikimedia.org/r/345529 (owner: 10Giuseppe Lavagetto)
[12:23:00] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "nice, a few minor comments inline" (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/345310 (https://phabricator.wikimedia.org/T157613) (owner: 10Gehel)
[12:23:26] <gehel>	 akosiaris: ^ "a few" = 12 ... :P
[12:24:41] <elukey>	 reminds me volans :P
[12:24:46] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "I've personally looked at all hostname changes and verified everything should stay the same." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345509 (owner: 10Giuseppe Lavagetto)
[12:25:26] <volans>	 lol
[12:26:07] <addshore>	 moritzm! Is it worth me pointing out again that grafana uses a public API and that means all of the data is public again? I find the pii reasoning odd (as did i when initially trying to get access to graphite some years ago)
[12:28:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "Uhm ok, I was actually debated about where to put system::role definitions, as some hosts will have one role that is clearly a composition" [puppet] - 10https://gerrit.wikimedia.org/r/345538 (owner: 10Alexandros Kosiaris)
[12:28:59] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 031] Remove system::role { 'conftool-master': } [puppet] - 10https://gerrit.wikimedia.org/r/345537 (owner: 10Alexandros Kosiaris)
[12:30:04] <akosiaris>	 gehel: well... most are a ditto :P
[12:30:15] <gehel>	 yeah, I saw...
[12:33:14] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Done in https://wikitech.wikimedia.org/w/index.php?title=Puppet_coding&diff=1754530&oldid=1352155" [puppet] - 10https://gerrit.wikimedia.org/r/345538 (owner: 10Alexandros Kosiaris)
[12:34:47] <bblack>	 hah
[12:35:00] <bblack>	 maybe system::profile makes more sense?
[12:35:17] <bblack>	 I get confused a lot because the english words profile and role seem backwards from how puppet actually uses them, but whatever
[12:35:49] <bblack>	 but if a profile is a singular function of a host, and a role gathers up a set of profiles that are deployed together on a type of multi-function host
[12:36:04] <bblack>	 and motd was describing the list of functions.  it seems like it should suck up multiple system::profile for motd
[12:39:03] <icinga-wm>	 RECOVERY - puppet last run on mw1214 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[12:48:15] <wikibugs__>	 (03PS5) 10Gehel: postgresql - simplify creation of databases [puppet] - 10https://gerrit.wikimedia.org/r/345310 (https://phabricator.wikimedia.org/T157613)
[12:49:25] <moritzm>	 !log rebooting bast4001 for kernel update to 4.9
[12:49:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:14] <wikibugs__>	 (03CR) 10Gehel: postgresql - simplify creation of databases (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/345310 (https://phabricator.wikimedia.org/T157613) (owner: 10Gehel)
[12:50:40] <zeljkof>	 jouncebot: next
[12:50:41] <jouncebot>	 In 0 hour(s) and 9 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T1300)
[12:50:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "two minor comments but LGTM; we will be left with some cleanup to do afterwards, but that's not your current concern." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/333880 (owner: 10Elukey)
[12:54:20] <_joe_>	 elukey: I'd merge that patch and the eventual followup after the EU SWAT
[12:55:32] <wikibugs__>	 (03PS2) 10Hashar: [cleanup] Remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344991 (https://phabricator.wikimedia.org/T161530) (owner: 10Urbanecm)
[12:55:34] <wikibugs>	 (03PS2) 10Hashar: Allow eliminators and autoreviewers to move a file on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344989 (https://phabricator.wikimedia.org/T161532) (owner: 10Urbanecm)
[12:55:36] <wikibugs__>	 (03PS2) 10Hashar: Assign move-categorypages to sysops&bots only on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345093 (https://phabricator.wikimedia.org/T161551) (owner: 10Urbanecm)
[12:55:38] <wikibugs>	 (03PS2) 10Hashar: Enable Multimedia Viewer at officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342798 (https://phabricator.wikimedia.org/T160420) (owner: 10Urbanecm)
[12:55:39] <hashar>	 zeljkof: I have rebased urbanecm patches
[12:56:06] <Urbanecm>	 hashar, zeljkof: I'm ready for SWATing so if nothing has to be done before SWAT we can start I think
[12:56:23] <zeljkof>	 o/
[12:56:35] <elukey>	 _joe_ do you mean mine or yours ? :)
[12:56:43] <_joe_>	 yours
[12:56:48] <zeljkof>	 hashar: want to do the swat? should I?
[12:56:51] <_joe_>	 and then possibly mine ones
[12:56:53] <elukey>	 ah okok! Updating the commit msg
[12:56:54] <hashar>	 zeljkof: can you ?
[12:56:58] <zeljkof>	 hashar: sure
[12:57:00] <hashar>	 got a meeting in half an hour
[12:57:03] <icinga-wm>	 PROBLEM - puppet last run on mw1220 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:57:08] <elukey>	 _joe_ I need to review the last two
[12:57:11] <hashar>	 but I can assist until that meeting start :]
[12:57:18] * zeljkof is getting ready for swat
[12:57:27] <elukey>	 _joe_ but you can proceed if you want, you know my puppet level :)
[12:57:33] <hashar>	 I think we can deploy all four patches straight to mwdebug1001 and test there
[12:57:40] <zeljkof>	 hashar: it would be great if you could take a look at the patches and +1 them if you think they are fine
[12:57:49] <hashar>	 one is a throttle cleanup, the three others are each for separate wikis
[12:57:51] <_joe_>	 elukey: I'm involved in SWAT, I have two patches to publish
[12:57:52] <hashar>	 I did
[12:58:01] <zeljkof>	 hashar: great, thanks
[12:58:11] <zeljkof>	 _joe_: you are deploying your own patches?
[12:58:42] <hashar>	 then there are a couple patches to change some hostnames by _joe_ and elukey
[12:59:39] <_joe_>	 zeljkof: well I assumed I was going through the process
[12:59:52] <_joe_>	 but I can merge and deploy them myself
[13:00:04] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T1300).
[13:00:04] <jouncebot>	 Urbanecm and _joe_: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[13:00:16] <zeljkof>	 _joe_: you can deploy yourself, either before or after the rest of the patches
[13:00:24] <wikibugs>	 (03PS14) 10Elukey: role::memcached: refactor in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/333880
[13:00:27] <zeljkof>	 _joe_: you can deploy all patches, if you want to
[13:00:50] <zeljkof>	 but I can deploy the rest of the patches, if you prefer that
[13:00:53] <_joe_>	 zeljkof: no thanks, I have other things to do too, go on with the rest of the patches, I'll be here after
[13:01:32] <zeljkof>	 _joe_: ok, starting with the swat, will ping you when I am done, so you can take over
[13:01:38] <zeljkof>	 I can SWAT today!
[13:03:47] <moritzm>	 addshore: grafana supports more data sources (such as prometheus) and we can't rule out that this leaks PII data
[13:03:54] <wikibugs>	 (03PS1) 10BBlack: cxserver: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/345542
[13:03:56] <wikibugs__>	 (03PS1) 10BBlack: citoid: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/345543
[13:04:32] <wikibugs__>	 (03PS2) 10Gehel: Cirrus / Analytics - remove deprecated rsync job [puppet] - 10https://gerrit.wikimedia.org/r/345362
[13:06:15] <zeljkof>	 Urbanecm: merging and deploying 344991
[13:06:19] <Urbanecm>	 zeljkof, ack
[13:06:51] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344991 (https://phabricator.wikimedia.org/T161530) (owner: 10Urbanecm)
[13:10:07] <Urbanecm>	 zeljkof, jenkins don't work?
[13:10:19] <zeljkof>	 Urbanecm: looks like jenkins is busy :(
[13:10:28] <zeljkof>	 one of the jobs is still in queue
[13:10:49] <Urbanecm>	 zeljkof, maybe restart of the job would help but I'm not sure. 
[13:10:50] <wikibugs__>	 06Operations, 10ORES, 10Revision-Scoring-As-A-Service-Backlog, 06Services (designing), 15User-mobrovac: [spec] Active-active setup for ORES across datacenters (eqiad, codfw) - https://phabricator.wikimedia.org/T159615#3143490 (10akosiaris) Sorry for not answering sooner on this.  @mobrovac That's an arch...
[13:10:52] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] "Not sure if you want it to do it separately. But remember it has to be separated from the hostname -> ip selection too here and on the cod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345535 (https://phabricator.wikimedia.org/T160435) (owner: 10Marostegui)
[13:11:35] <wikibugs__>	 (03CR) 10Marostegui: "> Not sure if you want it to do it separately. But remember it has to" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345535 (https://phabricator.wikimedia.org/T160435) (owner: 10Marostegui)
[13:12:06] <wikibugs__>	 (03CR) 10Alexandros Kosiaris: [C: 031] "looks fine to me now (famous last words ;-))" [puppet] - 10https://gerrit.wikimedia.org/r/345310 (https://phabricator.wikimedia.org/T157613) (owner: 10Gehel)
[13:12:27] <zeljkof>	 hashar: the first patch, and jenkins is stuck :(
[13:13:04] <zeljkof>	 operations-mw-config-composer-hhvm-jessiequeued jobs for 344991,2 is in queue for 5 minutes
[13:13:36] <wikibugs__>	 (03PS2) 10Giuseppe Lavagetto: role::mediawiki::common: extract scap::proxy to a profile [puppet] - 10https://gerrit.wikimedia.org/r/345527
[13:15:03] <zeljkof>	 hashar: the only thing I see are a couple of core and wikibase jobs consuming a lot of instances
[13:15:26] <zeljkof>	 not sure what to do or how to speed things up
[13:15:49] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 04-1] "Also while you're touching this, fix sysctl.pp too:" [puppet] - 10https://gerrit.wikimedia.org/r/345366 (owner: 10Dzahn)
[13:16:40] <wikibugs__>	 (03CR) 10Gehel: [C: 032] Cirrus / Analytics - remove deprecated rsync job [puppet] - 10https://gerrit.wikimedia.org/r/345362 (owner: 10Gehel)
[13:16:50] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove db1057 [puppet] - 10https://gerrit.wikimedia.org/r/345545 (https://phabricator.wikimedia.org/T160435)
[13:16:51] <zeljkof>	 Urbanecm, hashar the job is finally running
[13:16:52] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::mediawiki::common: extract scap::proxy to a profile [puppet] - 10https://gerrit.wikimedia.org/r/345527 (owner: 10Giuseppe Lavagetto)
[13:16:56] <elukey>	 hashar: as FYI https://gerrit.wikimedia.org/r/#/c/333880/ - should be a noop
[13:16:57] <Urbanecm>	 zeljkof, how do you watch the jenkins status?
[13:16:59] <wikibugs__>	 (03Merged) 10jenkins-bot: [cleanup] Remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344991 (https://phabricator.wikimedia.org/T161530) (owner: 10Urbanecm)
[13:17:07] <zeljkof>	 Urbanecm: https://integration.wikimedia.org/zuul/
[13:17:08] <wikibugs__>	 (03PS3) 10Giuseppe Lavagetto: role::mediawiki::common: extract scap::proxy to a profile [puppet] - 10https://gerrit.wikimedia.org/r/345527
[13:17:10] <wikibugs>	 (03CR) 10jenkins-bot: [cleanup] Remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344991 (https://phabricator.wikimedia.org/T161530) (owner: 10Urbanecm)
[13:17:11] <Urbanecm>	 zeljkof, thank you
[13:17:25] <hashar>	 zeljkof: looks good now?
[13:17:28] <zeljkof>	 344991 merged, deploying
[13:17:46] <zeljkof>	 hashar: well, the commit got merged, only 10 minutes or so :/
[13:18:13] <hashar>	 yeah the Wikibase patches are consuming too many instances
[13:18:16] <hashar>	 that is being refined
[13:18:41] <hashar>	 zeljkof: have you CR+2 the other ones?
[13:19:07] <zeljkof>	 hashar: not yet, was not sure if that will make things worse :/
[13:19:17] <hashar>	 they get enqueued
[13:19:28] <hashar>	 and processed as instances are made available
[13:19:38] <logmsgbot>	 !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:344991|[cleanup] Remove expired rules (T161530)]] (duration: 00m 45s)
[13:19:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:44] <stashbot>	 T161530: Remove throttle for ptwiki event - https://phabricator.wikimedia.org/T161530
[13:19:57] <zeljkof>	 Urbanecm: 344991 deployed, nothing to check, right?
[13:20:29] <zeljkof>	 Urbanecm: the rest of the patches can be tested at mwdebug1002? (once deployed there)
[13:20:31] <Urbanecm>	 zeljkof, yep. Everything should be caught by filters.
[13:20:57] <Urbanecm>	 zeljkof, except 342798 as I have no access to officewiki. 
[13:21:11] <wikibugs>	 (03PS6) 10Gehel: postgresql - simplify creation of databases [puppet] - 10https://gerrit.wikimedia.org/r/345310 (https://phabricator.wikimedia.org/T157613)
[13:21:13] <wikibugs__>	 (03CR) 10Elukey: [C: 031] "https://puppet-compiler.wmflabs.org/5973/ - seems a noop (changes are related to previous patches from what I can tell)" [puppet] - 10https://gerrit.wikimedia.org/r/345530 (owner: 10Giuseppe Lavagetto)
[13:21:13] <icinga-wm>	 PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:21:17] <zeljkof>	 Urbanecm: ok, will ping you as commits are at mwdebug1002
[13:21:21] <wikibugs>	 (03PS1) 10Faidon Liambotis: mediawiki: kill more precise references [puppet] - 10https://gerrit.wikimedia.org/r/345546
[13:21:23] <Urbanecm>	 zeljkof, okay
[13:21:23] <wikibugs__>	 (03PS1) 10Faidon Liambotis: hhvm: kill a precise reference [puppet] - 10https://gerrit.wikimedia.org/r/345547
[13:21:25] <wikibugs>	 (03PS1) 10Faidon Liambotis: apache: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345548
[13:21:27] <wikibugs__>	 (03PS1) 10Faidon Liambotis: installserver: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345549
[13:21:29] <wikibugs>	 (03PS1) 10Faidon Liambotis: aptrepo: remove precise-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/345550
[13:21:31] <wikibugs__>	 (03PS1) 10Faidon Liambotis: ganglia: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345551
[13:21:33] <wikibugs>	 (03PS1) 10Faidon Liambotis: elasticsearch: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345552
[13:21:35] <wikibugs__>	 (03PS1) 10Faidon Liambotis: noc: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345553
[13:21:37] <wikibugs>	 (03PS1) 10Faidon Liambotis: puppetmaster: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345554
[13:21:39] <wikibugs__>	 (03PS1) 10Faidon Liambotis: memcached: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345555
[13:21:41] <wikibugs>	 (03PS1) 10Faidon Liambotis: openstack/nova: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345556
[13:21:43] <wikibugs__>	 (03PS1) 10Faidon Liambotis: ci: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345557
[13:21:45] <wikibugs>	 (03PS1) 10Faidon Liambotis: lxc: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345558
[13:21:47] <wikibugs__>	 (03PS1) 10Faidon Liambotis: toollabs: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345559
[13:21:49] <wikibugs>	 (03PS1) 10Ottomata: Attempt to run apt-get update before proceeding with installing cdh packages [puppet] - 10https://gerrit.wikimedia.org/r/345560
[13:21:54] <wikibugs__>	 (03PS2) 10Marostegui: site.pp,linux-host-entries.ttyS1: Remove db1057 [puppet] - 10https://gerrit.wikimedia.org/r/345545 (https://phabricator.wikimedia.org/T160435)
[13:21:56] <Urbanecm>	 Seems we'll have some jenkins problems, this will create a lot of jobs...
[13:22:01] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344989 (https://phabricator.wikimedia.org/T161532) (owner: 10Urbanecm)
[13:22:40] <zeljkof>	 and ops/puppet has priority... :/
[13:22:59] <Urbanecm>	 :(
[13:23:10] <wikibugs>	 (03CR) 10Elukey: "This one will probably need a broader audience. It looks good but we create a tight dependency with the discovery service in puppet, that " [puppet] - 10https://gerrit.wikimedia.org/r/345531 (owner: 10Giuseppe Lavagetto)
[13:23:22] <wikibugs__>	 (03PS2) 10Ottomata: Attempt to run apt-get update before proceeding with installing cdh packages [puppet] - 10https://gerrit.wikimedia.org/r/345560
[13:23:27] <wikibugs__>	 (03CR) 10Ottomata: [V: 032 C: 032] Attempt to run apt-get update before proceeding with installing cdh packages [puppet] - 10https://gerrit.wikimedia.org/r/345560 (owner: 10Ottomata)
[13:23:33] <zeljkof>	 hashar: ok, now a lot of ops/puppet commits are landed... :(
[13:23:34] <wikibugs>	 (03CR) 10Muehlenhoff: "That's a duplicate of https://gerrit.wikimedia.org/r/#/c/345502/, but feel free to merge yours." [puppet] - 10https://gerrit.wikimedia.org/r/345546 (owner: 10Faidon Liambotis)
[13:25:03] <icinga-wm>	 RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[13:25:06] <zeljkof>	 Urbanecm: I have no idea when the second commit will get merged...
[13:25:16] <zeljkof>	 I will ping you as soon as something happens...
[13:25:20] <zeljkof>	 but it might be a while
[13:25:29] <hashar>	 I have recommended to CR+2 all four patches
[13:25:34] <hashar>	 so you get CI processing them in the background
[13:25:36] <Urbanecm>	 Okay, I have time :)
[13:25:45] <hashar>	 and by the time your first commit is deployed, the others got merged
[13:25:48] <Urbanecm>	 and won't have to wait on it. 
[13:25:48] <zeljkof>	 hashar: oh well, looks like I should have done that...
[13:25:53] <hashar>	 most probably you can just push
[13:26:27] <zeljkof>	 hashar: I will +2 them now, looks like there is nothing else to do
[13:27:19] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342798 (https://phabricator.wikimedia.org/T160420) (owner: 10Urbanecm)
[13:27:34] <hashar>	 then you can fetch them on tin as needed and rebase one by one
[13:27:38] <_joe_>	 hashar, zeljkof I would like to avoid auto-merging my patches, btw, so if you want to +2 them (if they don't seem wrong)
[13:27:47] <hashar>	 or just deploy all those four trivial patches to mwdebug to test them all
[13:28:24] <wikibugs__>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345093 (https://phabricator.wikimedia.org/T161551) (owner: 10Urbanecm)
[13:28:38] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: role::mediawiki::common: convert to profile (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/345528
[13:28:42] <wikibugs__>	 (03Merged) 10jenkins-bot: Allow eliminators and autoreviewers to move a file on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344989 (https://phabricator.wikimedia.org/T161532) (owner: 10Urbanecm)
[13:28:47] <hashar>	 _joe_: I am in a meeting though
[13:29:05] <hashar>	 _joe_: but looks like Elukey reviewed them  and with the mwdebug deploy it should be fine
[13:29:30] <_joe_>	 hashar: elukey just reviewed my puppet patches, not the ones up to deploy, but don't worry
[13:29:33] <zeljkof>	 _joe_: I can take a look, but I am not familiar with that code
[13:29:45] <wikibugs>	 (03CR) 10jenkins-bot: Allow eliminators and autoreviewers to move a file on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344989 (https://phabricator.wikimedia.org/T161532) (owner: 10Urbanecm)
[13:29:50] <_joe_>	 zeljkof: kk, I'll auto-merge then :P
[13:30:19] <zeljkof>	 _joe_: that would actually make more sense, you know more about that than me :)
[13:31:15] <wikibugs>	 (03Merged) 10jenkins-bot: Assign move-categorypages to sysops&bots only on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345093 (https://phabricator.wikimedia.org/T161551) (owner: 10Urbanecm)
[13:31:31] <wikibugs>	 (03CR) 10jenkins-bot: Assign move-categorypages to sysops&bots only on nlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345093 (https://phabricator.wikimedia.org/T161551) (owner: 10Urbanecm)
[13:31:54] <icinga-wm>	 PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:31:54] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Multimedia Viewer at officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342798 (https://phabricator.wikimedia.org/T160420) (owner: 10Urbanecm)
[13:32:56] <wikibugs__>	 (03CR) 10Gehel: [C: 031] "LGTM - thanks for the cleanup!" [puppet] - 10https://gerrit.wikimedia.org/r/345552 (owner: 10Faidon Liambotis)
[13:33:12] <wikibugs>	 (03CR) 10jenkins-bot: Enable Multimedia Viewer at officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342798 (https://phabricator.wikimedia.org/T160420) (owner: 10Urbanecm)
[13:34:43] <zeljkof>	 hashar, Urbanecm: looks like ops/puppet patches were quick, all swat patches are merged, deploying to mwdebug1002
[13:36:02] <wikibugs>	 (03PS1) 10Faidon Liambotis: Standardize on lowercase os_version/require_os [puppet] - 10https://gerrit.wikimedia.org/r/345561
[13:36:19] <Urbanecm>	 zeljkof, ack
[13:36:37] <wikibugs__>	 (03CR) 10Muehlenhoff: "Why is that needed? "apt-get update" is run with every puppet run anyway?" [puppet] - 10https://gerrit.wikimedia.org/r/345560 (owner: 10Ottomata)
[13:37:24] <zeljkof>	 Urbanecm: ok, all patches at mwdebug1002
[13:37:38] <zeljkof>	 I will try to check officewiki one (342798)
[13:37:50] * Urbanecm is going to test them
[13:38:00] <zeljkof>	 let me know when you have tested the other two
[13:38:15] <godog>	 volans: yup new backend hosts and expired downtime :(
[13:38:31] <volans>	 I figured, no worries
[13:39:29] <wikibugs__>	 (03PS1) 10Ottomata: Add passwords::mysql::analytics_labsdb class to match prod [labs/private] - 10https://gerrit.wikimedia.org/r/345562
[13:39:46] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] role::mediawiki::common: convert to profile (step 1) [puppet] - 10https://gerrit.wikimedia.org/r/345528 (owner: 10Giuseppe Lavagetto)
[13:39:52] <wikibugs__>	 (03CR) 10Ottomata: [V: 032 C: 032] Add passwords::mysql::analytics_labsdb class to match prod [labs/private] - 10https://gerrit.wikimedia.org/r/345562 (owner: 10Ottomata)
[13:40:01] <_joe_>	 ahah I E
[13:40:07] <_joe_>	 BEATED YOU OTTO
[13:40:31] <elukey>	 lol
[13:40:51] <Urbanecm>	 zeljkof, working, please deploy them to the whole cluster
[13:40:54] <wikibugs__>	 (03PS1) 10Faidon Liambotis: vagrant: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345563
[13:40:56] <wikibugs>	 (03PS1) 10Faidon Liambotis: interface: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345564
[13:41:29] <zeljkof>	 Urbanecm: will do, just a minute to check officewiki
[13:42:34] <zeljkof>	 multimediaviewer works on office wiki, deploying all patches
[13:43:36] <Urbanecm>	 zeljkof, thank you
[13:46:54] <logmsgbot>	 !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:344989|Allow eliminators and autoreviewers to move a file on ptwiki (T161532)]] [[gerrit:345093|Assign move-categorypages to sysops&bots only on nlwiki (T161551)]] [[gerrit:342798|Enable Multimedia Viewer at officewiki (T160420)]] (duration: 00m 44s)
[13:47:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:47:02] <stashbot>	 T160420: Install multimediaviewer on office wiki - https://phabricator.wikimedia.org/T160420
[13:47:02] <stashbot>	 T161551: Remove "move-categorypages" for normal users on nlwiki - https://phabricator.wikimedia.org/T161551
[13:47:02] <stashbot>	 T161532: Add permission "movefile"  to two user groups (ptwiki) - https://phabricator.wikimedia.org/T161532
[13:47:03] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 647036
[13:47:15] <zeljkof>	 Urbanecm: everything deployed, please check production, I will check officewiki
[13:47:22] <zeljkof>	 _joe_: swat is all yours
[13:47:24] <Urbanecm>	 zeljkof, okay
[13:47:26] <_joe_>	 zeljkof: thanks
[13:47:43] <zeljkof>	 _joe_: apologies for the delay, jenkins was busy
[13:47:51] <_joe_>	 yeah not your fault
[13:47:59] <Urbanecm>	 jouncebot, now
[13:47:59] <jouncebot>	 For the next 0 hour(s) and 12 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T1300)
[13:48:16] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Convert parsoid, mathoid, restbase urls to use the discovery hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345508
[13:49:11] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Convert parsoid, mathoid, restbase urls to use the discovery hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345508 (owner: 10Giuseppe Lavagetto)
[13:49:27] <Urbanecm>	 zeljkof, working, thank you!
[13:49:49] <zeljkof>	 Urbanecm: I can not get multimedia viewer to work on office wiki :/
[13:50:17] <Urbanecm>	 zeljkof, did it work at mwdebug1002?
[13:50:33] <zeljkof>	 it was working fine on mwdebug1002
[13:50:41] <Urbanecm>	 But why?
[13:50:58] <zeljkof>	 that's what is confusing :(
[13:51:13] <icinga-wm>	 RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[13:51:17] <kaldari>	 you guys still Swatting?
[13:51:25] <Urbanecm>	 yes
[13:51:27] <zeljkof>	 kaldari: _joe_ is
[13:51:46] <_joe_>	 kaldari: yeah I'm merging my own patches, I'm not really a deployer and I have a ton of things to do after these patches
[13:51:49] <kaldari>	 _joe_: Is it too late to add a config change to the swat?
[13:51:57] <wikibugs__>	 (03Merged) 10jenkins-bot: Convert parsoid, mathoid, restbase urls to use the discovery hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345508 (owner: 10Giuseppe Lavagetto)
[13:51:58] <_joe_>	 kaldari: ask zeljkof :)
[13:52:07] <wikibugs>	 (03CR) 10jenkins-bot: Convert parsoid, mathoid, restbase urls to use the discovery hostname [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345508 (owner: 10Giuseppe Lavagetto)
[13:52:08] <Urbanecm>	 zeljkof, is https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php cached or does it reflect current config?
[13:52:08] <zeljkof>	 Urbanecm: when I enabled the extension, selected mwdebug1002, clicked on a thumbnail, mvw opened 
[13:52:19] <Urbanecm>	 zeljkof, which is what we want. 
[13:52:38] <zeljkof>	 kaldari: I am around. want to deploy it yourself, or should I?
[13:53:15] <zeljkof>	 Urbanecm: but now without the extension enabled, officewiki just opens an image in a separate page, not in MMV
[13:53:19] <kaldari>	 zeljkof: I can do it myself if you want to take off
[13:53:26] <wikibugs__>	 (03CR) 10Elukey: "Again really sorry for the delay! If you still have patience I left some comments for this code review, thanks for the time!" (032 comments) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/318053 (owner: 10R4q3NWnUx2CEhVyr)
[13:53:46] <zeljkof>	 kaldari: if _joe_ is done with his part of deploy, feel free to do your :)
[13:54:06] <zeljkof>	 kaldari: I am around, if you prefer that I do the deploy
[13:54:22] <Urbanecm>	 zeljkof, is it possible to check if it is really enabled at prod? This is only one thing that came to my mind.
[13:54:31] <hashar>	 back around
[13:54:37] <zeljkof>	 Urbanecm: not sure
[13:54:41] <hashar>	 anything else left to do still?
[13:54:51] <zeljkof>	 hashar: can you check if multimedia viewer is enabled at office wiki? 
[13:54:55] <Urbanecm>	 hashar, we don't know why multimedia viewer work with mwdebug1002 and not at prod. 
[13:54:56] <kaldari>	 _joe_: Just let me know when you're done with yours
[13:54:58] <hashar>	 for multimedia viewer, if it does not work on office wiki it is not a problem
[13:55:03] <hashar>	 we can just report it is broken on the task
[13:55:10] <hashar>	 tgr said we can debug it / fix it later
[13:55:13] <zeljkof>	 just go to hope page and click an image, it should open in a MMV popup
[13:55:18] <Urbanecm>	 hashar, okay, I write it there. 
[13:55:35] <zeljkof>	 hashar: strange thing, it worked on mwdebug1002 :/
[13:55:49] <hashar>	 are you sure you did a scap pull on both ?
[13:56:07] <zeljkof>	 hashar: nevermind, works now
[13:56:13] <zeljkof>	 maybe it needed a few minutes
[13:56:16] <Urbanecm>	 zeljkof, hashar: So it is working or not?
[13:56:24] <zeljkof>	 Urbanecm: MMV works on office wiki
[13:56:24] <Urbanecm>	 Should I write to the task anything special?
[13:56:30] <hashar>	 works for me on mwdebug1002
[13:56:35] <Urbanecm>	 zeljkof, okay, thank you for deploys and checking!
[13:56:58] <wikibugs>	 (03PS1) 10Faidon Liambotis: Switch add_ip6_mapped to use $::interface_primary [puppet] - 10https://gerrit.wikimedia.org/r/345568
[13:57:09] <hashar>	 Urbanecm: thanks for the patch :]
[13:57:10] <wikibugs__>	 (03PS7) 10Gehel: postgresql - simplify creation of databases [puppet] - 10https://gerrit.wikimedia.org/r/345310 (https://phabricator.wikimedia.org/T157613)
[13:57:15] <zeljkof>	 Urbanecm: no problem :) thanks for flying with #releng
[13:57:17] <Urbanecm>	 You're welcome. 
[13:57:23] <hashar>	 zeljkof: so yeah I guess you can sync the multimedia viewer change to the cluster
[13:57:31] <zeljkof>	 hashar: it already is
[13:57:34] <_joe_>	 kaldari: I'm merging and testing at the same time, so it will take me some additional time tbh
[13:57:35] <hashar>	 \o/
[13:57:52] <zeljkof>	 it worked fine on mwdebug, pushed to cluster, but could not verify for a few minutes
[13:57:57] <zeljkof>	 works now
[13:58:02] <kaldari>	 _joe_: no worries, take your time
[13:58:13] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 636695
[13:58:18] <hashar>	 zeljkof: maybe because of resourceloader cache. I dont quite now it get invalidated
[13:58:31] <zeljkof>	 hashar: maybe
[14:00:53] <icinga-wm>	 RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[14:01:09] <logmsgbot>	 !log oblivian@tin Synchronized wmf-config/ProductionServices.php: switch to discovery for some records (duration: 00m 47s)
[14:01:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:10] <wikibugs>	 (03PS1) 10Faidon Liambotis: Replace $main_ipaddress by $::interface_primary [puppet] - 10https://gerrit.wikimedia.org/r/345569
[14:03:05] <wikibugs__>	 (03PS2) 10Giuseppe Lavagetto: Convert cxserver, eventbus to use the discovery url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345509
[14:03:13] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Convert cxserver, eventbus to use the discovery url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345509 (owner: 10Giuseppe Lavagetto)
[14:03:23] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 706171
[14:04:30] <wikibugs__>	 (03Merged) 10jenkins-bot: Convert cxserver, eventbus to use the discovery url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345509 (owner: 10Giuseppe Lavagetto)
[14:04:43] <wikibugs>	 (03CR) 10jenkins-bot: Convert cxserver, eventbus to use the discovery url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345509 (owner: 10Giuseppe Lavagetto)
[14:04:54] <wikibugs>	 (03CR) 10Gehel: [C: 032] postgresql - simplify creation of databases [puppet] - 10https://gerrit.wikimedia.org/r/345310 (https://phabricator.wikimedia.org/T157613) (owner: 10Gehel)
[14:06:09] <_joe_>	 kaldari: I'm almost done, need to watch logs for a few to be sure nothing is exploding tho
[14:06:24] <logmsgbot>	 !log oblivian@tin Synchronized wmf-config/ProductionServices.php: switch to discovery for cxserver,eventbus (duration: 00m 43s)
[14:06:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:23] <icinga-wm>	 PROBLEM - puppet last run on elastic1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:08:57] <wikibugs__>	 (03PS1) 10Gehel: Revert "postgresql - simplify creation of databases" [puppet] - 10https://gerrit.wikimedia.org/r/345572
[14:08:58] <_joe_>	 kaldari, zeljkof I'm done
[14:09:07] <wikibugs__>	 (03CR) 10Gehel: [V: 032 C: 032] Revert "postgresql - simplify creation of databases" [puppet] - 10https://gerrit.wikimedia.org/r/345572 (owner: 10Gehel)
[14:09:23] <icinga-wm>	 RECOVERY - puppet last run on ms-be1032 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[14:09:53] <icinga-wm>	 RECOVERY - puppet last run on ms-be1033 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[14:10:29] <zeljkof>	 kaldari: if no other deployment is happening, take over swat
[14:10:31] <kaldari>	 _joe_: Thanks, you have anything else zeljkof or should I go now
[14:10:32] <kaldari>	 ?
[14:10:35] <kaldari>	 Cool
[14:10:44] <zeljkof>	 kaldari: I am done
[14:11:57] <wikibugs__>	 (03PS1) 10Gehel: postgresql - simplify creation of databases [puppet] - 10https://gerrit.wikimedia.org/r/345573 (https://phabricator.wikimedia.org/T157613)
[14:13:00] <wikibugs>	 (03CR) 10Kaldari: [C: 032] Enable cookie blocking on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324672 (https://phabricator.wikimedia.org/T152076) (owner: 10Kaldari)
[14:13:10] <wikibugs__>	 (03PS3) 10Kaldari: Enable cookie blocking on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324672 (https://phabricator.wikimedia.org/T152076)
[14:15:47] <wikibugs>	 (03CR) 10R4q3NWnUx2CEhVyr: "The code flow is as follows:" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/318053 (owner: 10R4q3NWnUx2CEhVyr)
[14:18:43] <wikibugs>	 (03CR) 10jenkins-bot: Enable cookie blocking on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324672 (https://phabricator.wikimedia.org/T152076) (owner: 10Kaldari)
[14:20:59] <marostegui>	 is swat done?
[14:21:06] <kaldari>	 marostegui: almost
[14:21:09] <marostegui>	 oki!
[14:21:48] <kaldari>	 !log sync InitialiseSettings.php to enable cookie blocking on English Wikipedia
[14:21:53] <wikibugs>	 (03PS2) 10Gehel: postgresql - simplify creation of databases [puppet] - 10https://gerrit.wikimedia.org/r/345573 (https://phabricator.wikimedia.org/T157613)
[14:21:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:13] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345574
[14:22:16] <wikibugs__>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345574
[14:22:44] <logmsgbot>	 !log kaldari@tin Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 00m 48s)
[14:22:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:28] <kaldari>	 marostegui: SWAT is done!
[14:27:38] <marostegui>	 \o/
[14:27:39] <marostegui>	 thanks!
[14:28:02] <wikibugs__>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345574 (owner: 10Marostegui)
[14:28:04] <wikibugs>	 (03PS1) 10Ottomata: Remove unused statistics::rsync::webrequest class from statistics::private [puppet] - 10https://gerrit.wikimedia.org/r/345578
[14:29:11] <wikibugs__>	 (03CR) 10Ottomata: [V: 032 C: 032] Remove unused statistics::rsync::webrequest class from statistics::private [puppet] - 10https://gerrit.wikimedia.org/r/345578 (owner: 10Ottomata)
[14:29:17] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345574 (owner: 10Marostegui)
[14:29:29] <wikibugs__>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345574 (owner: 10Marostegui)
[14:30:11] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1090 - T17441 (duration: 00m 45s)
[14:30:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:18] <stashbot>	 T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441
[14:30:31] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Remove db1057 from s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345535 (https://phabricator.wikimedia.org/T160435)
[14:31:33] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: x1 on db2033 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 446.87 seconds
[14:32:30] <marostegui>	 ^ checking
[14:33:11] <moritzm>	 !log rebooting restbase2001 to Linux 4.9
[14:33:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:48] <andrewbogott>	 !log upgrading nova-compute to 12.0.6 on all labvirts
[14:33:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:51] <icinga-wm>	 PROBLEM - swift-object-replicator on ms-be1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[14:36:10] <icinga-wm>	 PROBLEM - swift-object-server on ms-be1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[14:36:50] <icinga-wm>	 RECOVERY - puppet last run on elastic1030 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[14:37:00] <icinga-wm>	 PROBLEM - puppet last run on ms-be1031 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 5 minutes ago with 5 failures. Failed resources (up to 3 shown): Service[swift-account-replicator],Service[swift-account-reaper],Service[swift-account-auditor],Service[swift-object]
[14:37:10] <godog>	 still me ^ silencing
[14:37:19] <wikibugs>	 (03CR) 10Elukey: "Completely right, the Gerrit preview fooled me! Thanks for the explanation, will test your code today/tomorrow!" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/318053 (owner: 10R4q3NWnUx2CEhVyr)
[14:37:30] <icinga-wm>	 PROBLEM - swift-account-auditor on ms-be1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[14:37:40] <icinga-wm>	 PROBLEM - swift-account-reaper on ms-be1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[14:38:00] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[14:38:29] <godog>	 !log run stress test (w/ bonnie) on new swift hw - T160640
[14:38:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:36] <stashbot>	 T160640: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640
[14:39:11] <wikibugs__>	 (03CR) 10Gehel: [C: 032] postgresql - simplify creation of databases [puppet] - 10https://gerrit.wikimedia.org/r/345573 (https://phabricator.wikimedia.org/T157613) (owner: 10Gehel)
[14:39:18] <wikibugs>	 (03PS3) 10Gehel: postgresql - simplify creation of databases [puppet] - 10https://gerrit.wikimedia.org/r/345573 (https://phabricator.wikimedia.org/T157613)
[14:40:41] <wikibugs__>	 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3143658 (10Papaul) 1st crash  Date: October 24, 2016  Troubleshooting : removed both PSU's for a couple of minutes   2nd crash  Date Dcember 12...
[14:41:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 0.1.37 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/345283 (https://phabricator.wikimedia.org/T151553) (owner: 10Gilles)
[14:43:40] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: x1 on db2033 is OK: OK slave_sql_lag Replication lag: 0.07 seconds
[14:45:50] <icinga-wm>	 PROBLEM - puppet last run on maps2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:46:50] <icinga-wm>	 PROBLEM - puppet last run on wtp1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:48:11] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 314
[14:48:23] <wikibugs__>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Remove db1057 from s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345535 (https://phabricator.wikimedia.org/T160435) (owner: 10Marostegui)
[14:48:50] <icinga-wm>	 PROBLEM - puppet last run on maps1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:49:32] <wikibugs__>	 (03Merged) 10jenkins-bot: db-eqiad.php: Remove db1057 from s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345535 (https://phabricator.wikimedia.org/T160435) (owner: 10Marostegui)
[14:49:41] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Remove db1057 from s1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345535 (https://phabricator.wikimedia.org/T160435) (owner: 10Marostegui)
[14:49:46] <gehel>	 puppet fail on maps* is me, fix coming up
[14:50:57] <wikibugs__>	 (03PS1) 10Gehel: postgresql - package management has moved to postgresql::server class [puppet] - 10https://gerrit.wikimedia.org/r/345580
[14:50:59] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove db1057 entry from s1 shard - T160435 (duration: 00m 44s)
[14:51:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:05] <stashbot>	 T160435: db1057 does not react to powercycle/powerdown/powerup commands - https://phabricator.wikimedia.org/T160435
[14:54:16] <wikibugs>	 (03CR) 10Gehel: [C: 032] postgresql - package management has moved to postgresql::server class [puppet] - 10https://gerrit.wikimedia.org/r/345580 (owner: 10Gehel)
[14:54:40] <icinga-wm>	 PROBLEM - puppet last run on maps-test2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:54:51] <wikibugs__>	 (03PS1) 10Faidon Liambotis: motd: remove precise, add comments for stretch [puppet] - 10https://gerrit.wikimedia.org/r/345581
[14:55:40] <icinga-wm>	 PROBLEM - puppet last run on maps2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:55:50] <icinga-wm>	 RECOVERY - puppet last run on maps2002 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures
[14:59:59] <wikibugs__>	 (03CR) 10Hashar: "Standardization!! I found a few more using the perl regex:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/345561 (owner: 10Faidon Liambotis)
[15:00:50] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[15:01:00] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[15:01:00] <icinga-wm>	 RECOVERY - puppet last run on ms-be1031 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures
[15:01:10] <icinga-wm>	 RECOVERY - swift-object-server on ms-be1031 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[15:01:30] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[15:01:40] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[15:05:33] <wikibugs>	 (03PS2) 10Faidon Liambotis: Standardize on lowercase os_version/require_os [puppet] - 10https://gerrit.wikimedia.org/r/345561
[15:06:11] <wikibugs__>	 (03CR) 10Faidon Liambotis: "Thanks Antoine, a couple of more fixed. There are more that surface with that grep, but they are being fixed by the precise-remnants topic" [puppet] - 10https://gerrit.wikimedia.org/r/345561 (owner: 10Faidon Liambotis)
[15:12:00] <wikibugs>	 (03Draft1) 10Paladox: Gerrit: Add log4j.logger.org.apache.sshd.common.keyprovider.FileKeyPairProvider=INFO to log4j [puppet] - 10https://gerrit.wikimedia.org/r/345583
[15:12:03] <wikibugs__>	 (03PS2) 10Paladox: Gerrit: Add log4j.logger.org.apache.sshd.common.keyprovider.FileKeyPairProvider=INFO to log4j [puppet] - 10https://gerrit.wikimedia.org/r/345583
[15:13:30] <icinga-wm>	 PROBLEM - puppet last run on wdqs1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:14:29] <wikibugs>	 06Operations, 06Analytics-Kanban, 10DBA: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#3143721 (10Nuria) a:03Ottomata
[15:15:28] <wikibugs>	 06Operations, 06Analytics-Kanban, 10DBA: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#1952524 (10Nuria) Let's take advantage of the fact that after the rename we have now autoincrement ids on new tables .
[15:15:50] <icinga-wm>	 RECOVERY - puppet last run on wtp1001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[15:16:00] <icinga-wm>	 PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:16:10] <wikibugs__>	 (03CR) 10Faidon Liambotis: [C: 032] motd: remove precise, add comments for stretch [puppet] - 10https://gerrit.wikimedia.org/r/345581 (owner: 10Faidon Liambotis)
[15:16:29] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: role::mediawiki::common: convert to profiles (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/345529
[15:17:50] <icinga-wm>	 RECOVERY - puppet last run on maps1004 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[15:18:28] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 032] role::mediawiki::common: convert to profiles (step 2) [puppet] - 10https://gerrit.wikimedia.org/r/345529 (owner: 10Giuseppe Lavagetto)
[15:18:45] <wikibugs>	 (03CR) 10Ottomata: [C: 031] Prepare analytics1027 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/345117 (https://phabricator.wikimedia.org/T161597) (owner: 10Elukey)
[15:19:16] <_joe_>	 paravoid: can I merge your change?
[15:19:51] <paravoid>	 yes please
[15:21:36] <paravoid>	 _joe_: ^
[15:21:46] <_joe_>	 paravoid: already done
[15:21:56] <paravoid>	 do we have a job these days to trigger the puppet compiler and report back if a change is noop?
[15:22:02] <_joe_>	 sorry, I was concentrated on monitoring my change
[15:22:18] <_joe_>	 paravoid: nope
[15:22:24] <paravoid>	 :(
[15:22:34] <_joe_>	 paravoid: it's pretty easy to write using cumin libraries to query puppetdb, too
[15:22:40] <icinga-wm>	 RECOVERY - puppet last run on maps-test2003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[15:22:49] <_joe_>	 to create a host list, I mean
[15:23:00] <bblack>	 a list of affected hosts you mean?
[15:23:03] <_joe_>	 yes
[15:23:36] <volans>	 paravoid: also if the change is only in installed packages might not be reported by the puppet compiler
[15:23:57] <_joe_>	 so the idea CI job would: extract name of changed classes or defines from the changed files
[15:24:00] <paravoid>	 I was thinking of stuff like https://gerrit.wikimedia.org/r/345561
[15:24:35] <paravoid>	 brb
[15:24:40] <icinga-wm>	 RECOVERY - puppet last run on maps2003 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures
[15:25:02] <volans>	 that probably is worth a puppet compiler run without specifying the hosts, will take time though
[15:27:04] <icinga-wm>	 PROBLEM - nova-conductor process on labcontrol1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-conductor
[15:27:10] <icinga-wm>	 PROBLEM - DPKG on labcontrol1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[15:27:20] <icinga-wm>	 PROBLEM - nova-scheduler process on labcontrol1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-scheduler
[15:27:29] <paladox>	 chasemp ^^
[15:27:34] <bblack>	 that's paging too
[15:27:53] <chasemp>	 andrewbogott: moritzm are you guys doing labvirt maint still?
[15:28:04] <andrewbogott>	 yeah, those alerts are me
[15:28:10] <icinga-wm>	 RECOVERY - nova-conductor process on labcontrol1001 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/nova-conductor
[15:28:25] <icinga-wm>	 PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[15:28:25] <icinga-wm>	 RECOVERY - nova-scheduler process on labcontrol1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-scheduler
[15:28:31] <andrewbogott>	 shoudl be fixed
[15:28:33] <chasemp>	 bblack: sorry, upgrades in progress I believe and forgot to silence
[15:28:59] <andrewbogott>	 Yeah, it shouldn't have stopped any services, something busted in the .deb I guess :(
[15:29:15] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: move role to profile, unify videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/345530
[15:29:15] <icinga-wm>	 RECOVERY - DPKG on labcontrol1001 is OK: All packages OK
[15:29:25] <icinga-wm>	 RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[15:29:28] <andrewbogott>	 the same process worked without alerts on labtest
[15:30:10] <_joe_>	 andrewbogott: or, when doing it on production and not test you have traffic, load, etc interfering
[15:30:13] <_joe_>	 :)
[15:30:59] <andrewbogott>	 It was apt-get fussing about a config file — it removed the file and then got upset that it wasn't there.  I touched it and then all was well.
[15:32:35] <wikibugs>	 (03PS1) 10Tobias Gritschacher: Add an alternative ssh-key for goransm [puppet] - 10https://gerrit.wikimedia.org/r/345587
[15:33:45] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.192.16.188:7001 on restbase2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[15:33:55] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.16.188:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.188 and port 9042: Connection refused
[15:34:25] <icinga-wm>	 PROBLEM - Check systemd state on restbase2010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:35:25] <icinga-wm>	 RECOVERY - Check systemd state on restbase2010 is OK: OK - running: The system is fully operational
[15:35:45] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.192.16.188:7001 on restbase2010 is OK: SSL OK - Certificate restbase2010-c valid until 2017-11-17 00:54:27 +0000 (expires in 231 days)
[15:35:55] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.192.16.188:9042 on restbase2010 is OK: TCP OK - 0.036 second response time on 10.192.16.188 port 9042
[15:38:08] <wikibugs>	 (03CR) 10GoranSMilovanovic: [C: 031] Add an alternative ssh-key for goransm [puppet] - 10https://gerrit.wikimedia.org/r/345587 (owner: 10Tobias Gritschacher)
[15:39:18] <wikibugs__>	 (03PS1) 10Andrew Bogott: Include nova's policy.json along with all services. [puppet] - 10https://gerrit.wikimedia.org/r/345588
[15:39:27] <wikibugs>	 (03CR) 10Elukey: "Other comments if you have time :)" (035 comments) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/318053 (owner: 10R4q3NWnUx2CEhVyr)
[15:40:21] <wikibugs__>	 (03CR) 10jerkins-bot: [V: 04-1] Include nova's policy.json along with all services. [puppet] - 10https://gerrit.wikimedia.org/r/345588 (owner: 10Andrew Bogott)
[15:41:35] <icinga-wm>	 RECOVERY - puppet last run on wdqs1003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[15:41:54] <elukey>	 urandom: restbase2010 ?
[15:42:11] <wikibugs__>	 (03PS2) 10BBlack: cxserver: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/345542
[15:42:24] <urandom>	 elukey: looking
[15:42:32] <wikibugs>	 (03PS2) 10BBlack: citoid: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/345543
[15:42:45] <urandom>	 elukey: probably an oom :o
[15:42:51] <wikibugs__>	 (03PS1) 10BBlack: maps: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/345591
[15:43:08] <wikibugs>	 (03PS1) 10BBlack: swift/upload: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/345592
[15:44:05] <elukey>	 subbu: yessss
[15:44:11] <elukey>	 errr urandom 
[15:44:17] <elukey>	 sorry :)
[15:44:41] <urandom>	 elukey: yeah, two in a row; so if it keeps doing it we can try dialing down the tombstone threshold until it passes
[15:45:05] <icinga-wm>	 RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[15:45:55] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.16.188:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.188 and port 9042: Connection refused
[15:46:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:46:45] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 48652.559023 Seconds
[15:46:55] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.192.16.188:9042 on restbase2010 is OK: TCP OK - 0.036 second response time on 10.192.16.188 port 9042
[15:47:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy
[15:47:15] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.137 and port 9042: Connection refused
[15:47:35] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 48709.626435 Seconds
[15:47:36] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 48709.629132 Seconds
[15:47:45] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.32.137:7001 on restbase2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[15:47:58] <urandom>	 ugh
[15:48:45] <icinga-wm>	 PROBLEM - cassandra-a service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
[15:48:45] <icinga-wm>	 PROBLEM - Check systemd state on restbase2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:49:45] <icinga-wm>	 RECOVERY - cassandra-a service on restbase2004 is OK: OK - cassandra-a is active
[15:49:45] <icinga-wm>	 RECOVERY - Check systemd state on restbase2004 is OK: OK - running: The system is fully operational
[15:49:45] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds
[15:50:35] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds
[15:50:35] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds
[15:50:45] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.32.137:7001 on restbase2004 is OK: SSL OK - Certificate restbase2004-a valid until 2017-09-12 15:35:23 +0000 (expires in 165 days)
[15:51:15] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is OK: TCP OK - 0.037 second response time on 10.192.32.137 port 9042
[15:51:18] <volans>	 urandom: need help?
[15:51:31] <urandom>	 volans: with this, or in general? :)
[15:51:51] <urandom>	 volans: this is a known issue
[15:51:52] <volans>	 lol :) with restbase complainin...
[15:52:11] <urandom>	 and it's happening in codfw which means it won't impact client reads
[15:52:14] <urandom>	 small favors
[15:52:47] <urandom>	 some update is tripping over a aberrant data
[15:53:20] <volans>	 ok, although not so ok :)
[15:53:45] <icinga-wm>	 PROBLEM - cassandra-c service on restbase2010 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
[15:53:45] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.192.16.188:7001 on restbase2010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[15:53:55] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.16.188:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.188 and port 9042: Connection refused
[15:54:25] <icinga-wm>	 PROBLEM - Check systemd state on restbase2010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[15:54:40] <wikibugs__>	 06Operations, 10Analytics, 07Documentation: Improve SSH access information in onboarding documentation - https://phabricator.wikimedia.org/T160941#3115695 (10Nuria) Can we be specific as to what needs improvements to help ops document what is needed?  cc @Zareenf, @Tbayer @mpopov who had had trouble with thi...
[15:55:45] <icinga-wm>	 RECOVERY - cassandra-c service on restbase2010 is OK: OK - cassandra-c is active
[15:56:25] <icinga-wm>	 RECOVERY - Check systemd state on restbase2010 is OK: OK - running: The system is fully operational
[15:56:45] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.192.16.188:7001 on restbase2010 is OK: SSL OK - Certificate restbase2010-c valid until 2017-11-17 00:54:27 +0000 (expires in 231 days)
[15:56:55] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.192.16.188:9042 on restbase2010 is OK: TCP OK - 0.038 second response time on 10.192.16.188 port 9042
[15:56:56] <wikibugs__>	 06Operations, 10ops-eqiad, 06Analytics-Kanban, 06DC-Ops, 13Patch-For-Review: Decom/Reclaim analytics1027 - https://phabricator.wikimedia.org/T161597#3143852 (10Nuria)
[15:57:17] <wikibugs>	 06Operations, 10ops-eqiad, 06Analytics-Kanban, 06DC-Ops, 13Patch-For-Review: Decom/Reclaim analytics1027 - https://phabricator.wikimedia.org/T161597#3136552 (10Nuria) a:03elukey
[15:58:40] <wikibugs>	 (03PS1) 10Gilles: Follow-up fixes for 0.1.37 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/345597
[15:58:50] <wikibugs__>	 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-fgiunchedi: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640#3143859 (10fgiunchedi) a:05Cmjohnson>03fgiunchedi
[15:59:15] <logmsgbot>	 !log twentyafterfour@tin Synchronized php-1.29.0-wmf.17/includes/: sync I7c5c0a9b1af99ce2b5f4bdcc99710d8400ca8bcf refs  T159319 (duration: 01m 41s)
[15:59:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Follow-up fixes for 0.1.37 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/345597 (owner: 10Gilles)
[15:59:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:04] <jouncebot>	 godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T1600).
[16:00:14] <wikibugs__>	 (03PS1) 10Gilles: Remove trailing slash from SWIFT_API_PATH in Thumbor config [puppet] - 10https://gerrit.wikimedia.org/r/345598
[16:01:12] <wikibugs>	 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: CODFW: 6 Nodes for Kafka refresh/upgrade - https://phabricator.wikimedia.org/T161637#3138006 (10Nuria) p:05Triage>03Normal
[16:01:46] <icinga-wm>	 PROBLEM - Apache HTTP on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:02:05] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:02:05] <icinga-wm>	 PROBLEM - HHVM rendering on mw1191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:02:21] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: profile::mediawiki::jobrunner: move role to profile, unify videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/345530
[16:03:09] <_joe_>	 !log restarting hhvm on mw1191, stuck in HPHP::Treadmill::getAgeOldestRequest
[16:03:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:55] <wikibugs__>	 (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::jobrunner: move role to profile, unify videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/345530 (owner: 10Giuseppe Lavagetto)
[16:06:55] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:07:07] <wikibugs__>	 (03PS2) 10Filippo Giunchedi: Remove trailing slash from SWIFT_API_PATH in Thumbor config [puppet] - 10https://gerrit.wikimedia.org/r/345598 (owner: 10Gilles)
[16:07:55] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1001 is OK: OK - running: The system is fully operational
[16:11:06] <wikibugs__>	 (03CR) 10Filippo Giunchedi: [C: 032] Remove trailing slash from SWIFT_API_PATH in Thumbor config [puppet] - 10https://gerrit.wikimedia.org/r/345598 (owner: 10Gilles)
[16:17:21] <godog>	 !log upgrade thumbor to 0.1.37 on thumbor100[12]
[16:17:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:39] <wikibugs__>	 (03PS2) 10Filippo Giunchedi: rancid: create 'configs' directory [puppet] - 10https://gerrit.wikimedia.org/r/345369
[16:19:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 032 C: 032] rancid: create 'configs' directory [puppet] - 10https://gerrit.wikimedia.org/r/345369 (owner: 10Filippo Giunchedi)
[16:20:06] <wikibugs__>	 06Operations, 05Goal: reduce amount of remaining Ubuntu 12.04 (precise) systems in production - https://phabricator.wikimedia.org/T123525#3143911 (10fgiunchedi)
[16:20:08] <wikibugs>	 06Operations, 13Patch-For-Review, 15User-fgiunchedi: upgrade netmon1001 to jessie - https://phabricator.wikimedia.org/T125020#3143909 (10fgiunchedi) 05Open>03Resolved This is completed!
[16:20:32] <wikibugs>	 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: Add request URL to thumbor errors - https://phabricator.wikimedia.org/T151553#3143912 (10Gilles) 05Open>03Resolved I'll actually enable them when there's ELK integration. Right now added to the existing log entries, it would be too verbose.
[16:28:57] <wikibugs>	 (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/345588 (owner: 10Andrew Bogott)
[16:30:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Include nova's policy.json along with all services. [puppet] - 10https://gerrit.wikimedia.org/r/345588 (owner: 10Andrew Bogott)
[16:31:50] <wikibugs>	 06Operations, 10hardware-requests: codfw: (1) netmon system - https://phabricator.wikimedia.org/T161807#3143960 (10fgiunchedi)
[16:33:19] <icinga-wm>	 PROBLEM - puppet last run on logstash1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:33:39] <icinga-wm>	 PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py]
[16:34:00] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345603 (https://phabricator.wikimedia.org/T128546)
[16:34:16] <wikibugs>	 (03PS2) 10Gilles: Make Thumbor connect to Swift via https [puppet] - 10https://gerrit.wikimedia.org/r/343263 (https://phabricator.wikimedia.org/T160670)
[16:37:29] <wikibugs>	 06Operations, 10Analytics, 07Documentation: Improve SSH access information in onboarding documentation - https://phabricator.wikimedia.org/T160941#3144005 (10Tbayer) >>! In T160941#3143836, @Nuria wrote: > Can we be specific as to what needs improvements to help ops document what is needed?  cc @Zareenf, @Tb...
[16:37:36] <wikibugs__>	 (03CR) 10Filippo Giunchedi: [C: 032] Make Thumbor connect to Swift via https [puppet] - 10https://gerrit.wikimedia.org/r/343263 (https://phabricator.wikimedia.org/T160670) (owner: 10Gilles)
[16:39:02] <wikibugs>	 06Operations, 10Analytics-EventLogging, 06Analytics-Kanban, 10DBA: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#3144009 (10Nuria)
[16:39:26] <wikibugs>	 (03PS2) 10Gehel: postgresql - fix tests [puppet] - 10https://gerrit.wikimedia.org/r/345338
[16:39:39] <wikibugs__>	 (03PS2) 10Filippo Giunchedi: Use proper proxy_next_upstream configuration for Thumbor's nginx [puppet] - 10https://gerrit.wikimedia.org/r/345315 (https://phabricator.wikimedia.org/T161613) (owner: 10Gilles)
[16:40:24] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "on hold until we check things with the maps team" [puppet] - 10https://gerrit.wikimedia.org/r/345591 (owner: 10BBlack)
[16:41:11] <wikibugs__>	 (03PS2) 10Andrew Bogott: Include nova's policy.json along with all services. [puppet] - 10https://gerrit.wikimedia.org/r/345588
[16:41:32] <wikibugs>	 (03PS3) 10Gehel: postgresql - fix tests [puppet] - 10https://gerrit.wikimedia.org/r/345338
[16:41:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Use proper proxy_next_upstream configuration for Thumbor's nginx [puppet] - 10https://gerrit.wikimedia.org/r/345315 (https://phabricator.wikimedia.org/T161613) (owner: 10Gilles)
[16:43:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Include nova's policy.json along with all services. [puppet] - 10https://gerrit.wikimedia.org/r/345588 (owner: 10Andrew Bogott)
[16:43:19] <wikibugs__>	 (03PS3) 10Andrew Bogott: Include nova's policy.json along with all services. [puppet] - 10https://gerrit.wikimedia.org/r/345588
[16:45:24] <wikibugs>	 (03PS1) 10Gilles: Follow-up fix for hasattr on early error (0.1.37) [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/345608
[16:45:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Include nova's policy.json along with all services. [puppet] - 10https://gerrit.wikimedia.org/r/345588 (owner: 10Andrew Bogott)
[16:46:19] <wikibugs__>	 (03PS4) 10Andrew Bogott: Include nova's policy.json along with all services. [puppet] - 10https://gerrit.wikimedia.org/r/345588
[16:47:30] <wikibugs>	 (03PS4) 10Gehel: postgresql - fix tests [puppet] - 10https://gerrit.wikimedia.org/r/345338
[16:47:34] <wikibugs__>	 06Operations, 10Analytics, 07Documentation: Improve SSH access information in onboarding documentation - https://phabricator.wikimedia.org/T160941#3144130 (10Nuria) This task will be seen by the ops on cleaning duty next week. It will help to have a list of issues so they can know what problems the documenta...
[16:48:32] <wikibugs__>	 (03CR) 10Andrew Bogott: [C: 032] Include nova's policy.json along with all services. [puppet] - 10https://gerrit.wikimedia.org/r/345588 (owner: 10Andrew Bogott)
[16:49:20] <wikibugs__>	 (03PS5) 10Gehel: postgresql - fix tests [puppet] - 10https://gerrit.wikimedia.org/r/345338
[16:50:02] <wikibugs>	 06Operations, 07Documentation: Improve SSH access information in onboarding documentation - https://phabricator.wikimedia.org/T160941#3144138 (10Nuria)
[16:50:58] <wikibugs__>	 (03CR) 10Gehel: [C: 032] postgresql - fix tests [puppet] - 10https://gerrit.wikimedia.org/r/345338 (owner: 10Gehel)
[16:51:29] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down!
[16:51:49] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[16:52:24] <gilles>	 godog: expected?
[16:52:29] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[16:53:40] <icinga-wm>	 RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[16:53:59] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down!
[16:54:08] <godog>	 mhh no gilles, checking
[16:54:31] <gilles>	 the retries working properly now probably increases the workload
[16:54:39] <wikibugs>	 (03Abandoned) 10Rush: WIP: labstore: nfs-mounts.yaml per role and nfs-manage-mounts adjust [puppet] - 10https://gerrit.wikimedia.org/r/345168 (https://phabricator.wikimedia.org/T158883) (owner: 10Rush)
[16:55:29] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1001.eqiad.wmnet because of too many down!
[16:59:18] <wikibugs__>	 06Operations, 10Traffic: Investigate 502 errors from nginx, when backend returns 302 - https://phabricator.wikimedia.org/T161819#3144203 (10EBernhardson)
[17:00:04] <jouncebot>	 gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T1700).
[17:00:37] <wikibugs>	 06Operations, 10Traffic: Investigate 502 errors from nginx, when backend returns 302 - https://phabricator.wikimedia.org/T161819#3144218 (10EBernhardson)
[17:00:46] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 210 bytes in 0.883 second response time
[17:00:54] <subbu>	 no parsoid deploy today
[17:01:06] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy
[17:01:06] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[17:01:16] <icinga-wm>	 RECOVERY - puppet last run on logstash1003 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[17:01:59] <wikibugs>	 06Operations, 10Traffic: Investigate 502 errors from nginx, when backend returns 302 - https://phabricator.wikimedia.org/T161819#3144203 (10EBernhardson)
[17:03:11] <wikibugs>	 06Operations, 10Traffic, 10Wikimedia-Logstash: Investigate 502 errors from nginx when backend returns 302 - https://phabricator.wikimedia.org/T161819#3144248 (10EBernhardson)
[17:04:06] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1001.eqiad.wmnet because of too many down!
[17:05:06] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[17:05:56] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:06:28] <godog>	 still looking into thumbor/nginx, will downtime that ^
[17:07:06] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 41529
[17:08:06] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1001.eqiad.wmnet because of too many down!
[17:09:06] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy
[17:10:41] <wikibugs>	 (03PS10) 10Gehel: maps - cleartables osm replication [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613)
[17:10:55] <wikibugs__>	 (03CR) 10Gehel: "rebased" [puppet] - 10https://gerrit.wikimedia.org/r/341563 (https://phabricator.wikimedia.org/T157613) (owner: 10Gehel)
[17:13:56] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1001.eqiad.wmnet because of too many down!
[17:15:46] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 210 bytes in 0.002 second response time
[17:15:57] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy
[17:18:57] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1001.eqiad.wmnet because of too many down!
[17:19:36] <icinga-wm>	 PROBLEM - Apache HTTP on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:19:46] <icinga-wm>	 PROBLEM - HHVM rendering on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:19:56] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[17:20:26] <icinga-wm>	 PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:20:56] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy
[17:21:51] <wikibugs>	 (03PS1) 10Gilles: Revert "Make Thumbor connect to Swift via https" [puppet] - 10https://gerrit.wikimedia.org/r/345615
[17:22:19] <wikibugs__>	 (03PS2) 10Filippo Giunchedi: Revert "Make Thumbor connect to Swift via https" [puppet] - 10https://gerrit.wikimedia.org/r/345615 (owner: 10Gilles)
[17:22:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Revert "Make Thumbor connect to Swift via https" [puppet] - 10https://gerrit.wikimedia.org/r/345615 (owner: 10Gilles)
[17:24:26] <icinga-wm>	 RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational
[17:24:30] <wikibugs>	 06Operations, 06Commons, 10Traffic, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517#3144389 (10matmarex) I think that's a different uselang hack.
[17:28:38] <elukey>	 cmjohnson1: hi!
[17:29:06] <cmjohnson1>	 hi 
[17:29:21] <elukey>	 was it today that we scheduled the test for thermal paste?
[17:29:37] <cmjohnson1>	 yes it was....i forgot to put on my cal...do you have time now?
[17:29:45] <elukey>	 sure if you have it!
[17:29:50] <cmjohnson1>	 yes
[17:29:54] <elukey>	 I'll turn off analytics1039
[17:32:34] <wikibugs__>	 06Operations, 06Commons, 10Traffic, 10Wikimedia-Site-requests, and 2 others: Allow anonymous users to change interface language on Commons with ULS - https://phabricator.wikimedia.org/T161517#3144415 (10Steinsplitter) >>! In T161517#3144389, @matmarex wrote: > I think that's a different uselang hack.  Yepp...
[17:32:55] <elukey>	 !log shutdown analytics1039 to apply new thermal paste - T132256
[17:33:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:01] <stashbot>	 T132256: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256
[17:33:34] <elukey>	 cmjohnson1: it should drain in 5 mins, will ping you when ready!
[17:38:08] <wikibugs__>	 06Operations, 07HHVM: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ) - https://phabricator.wikimedia.org/T161684#3144442 (10MoritzMuehlenhoff) I suspected this might crash due to our setting of kernel.perf_event_paranoid=3 and some kind of resource leak due to failing...
[17:41:36] <icinga-wm>	 PROBLEM - Check systemd state on mw1261 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[17:41:56] <icinga-wm>	 PROBLEM - HHVM processes on mw1261 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm
[17:43:26] <icinga-wm>	 RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.092 second response time
[17:43:36] <icinga-wm>	 RECOVERY - Check systemd state on mw1261 is OK: OK - running: The system is fully operational
[17:43:36] <icinga-wm>	 RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 73923 bytes in 4.070 second response time
[17:43:46] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.242 second response time
[17:43:47] <wikibugs>	 (03PS1) 1020after4: Phab: User base:service_unit for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/345617 (https://phabricator.wikimedia.org/T112765)
[17:43:56] <icinga-wm>	 RECOVERY - HHVM processes on mw1261 is OK: PROCS OK: 6 processes with command name hhvm
[17:45:02] <elukey>	 cmjohnson1: analytics1039 just shutdown :)
[17:45:09] <cmjohnson1>	 yep
[17:45:42] <wikibugs__>	 (03PS1) 1020after4: Phab: create some task types and corresponding custom fields. [puppet] - 10https://gerrit.wikimedia.org/r/345618
[17:50:54] <wikibugs>	 (03PS2) 10Dduvall: ci: Docker registry for container builds [puppet] - 10https://gerrit.wikimedia.org/r/345422 (https://phabricator.wikimedia.org/T161657)
[17:56:36] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 297699
[17:57:00] <cmjohnson1>	 elukey: powering up now 
[17:57:23] <elukey>	 nice!
[17:58:07] <elukey>	 I'll monitor the mcelog during the next days
[17:58:58] <cmjohnson1>	 sounds good
[18:00:04] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T1800). Please do the needful.
[18:00:04] <jouncebot>	 jan_drewniak: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[18:00:14] <wikibugs__>	 (03PS1) 10Ladsgroup: Enable ORES review tool in hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345619 (https://phabricator.wikimedia.org/T161621)
[18:00:48] <elukey>	 cmjohnson1: thanks a lot! 
[18:01:05] <cmjohnson1>	 yw...sorry i forgot this mornign
[18:01:39] <elukey>	 no problem :)
[18:01:55] <Amir1>	 If there is time, I added one thing to swat
[18:02:33] <MaxSem>	 guess I can swat
[18:03:16] <wikibugs>	 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 15User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#3144546 (10elukey) Chris applied the thermal paste and the host is up and running again. Will watch mcelog during the next days to see if...
[18:05:13] <MaxSem>	 jan_drewniak, your instructions are wrong. sync-portals should be the thing to deploy with, not run after deploy
[18:06:16] <wikibugs__>	 (03CR) 10MaxSem: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345603 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[18:07:29] <wikibugs__>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345603 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[18:07:38] <wikibugs__>	 (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345603 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[18:08:56] <icinga-wm>	 PROBLEM - puppet last run on db1075 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:09:32] <MaxSem>	 jan_drewniak, pulled on mwdebug1002
[18:12:26] <icinga-wm>	 PROBLEM - puppet last run on mc1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:13:48] <MaxSem>	 jan_drewniak, how does it look?
[18:14:54] <jan_drewniak>	 MaxSem: one sec
[18:15:53] <jan_drewniak>	 MaxSem: yup,  looks good
[18:16:58] <wikibugs>	 (03PS2) 1020after4: Phab: Use base:service_unit for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/345617 (https://phabricator.wikimedia.org/T112765)
[18:17:34] <wikibugs__>	 (03PS3) 1020after4: Phab: Use base:service_unit for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/345617 (https://phabricator.wikimedia.org/T112765)
[18:17:38] <logmsgbot>	 !log maxsem@tin Synchronized portals/prod/wikipedia.org/assets: (no justification provided) (duration: 00m 45s)
[18:17:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:22] <logmsgbot>	 !log maxsem@tin Synchronized portals: (no justification provided) (duration: 00m 44s)
[18:18:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:39] <MaxSem>	 jan_drewniak, deployed, please test
[18:18:52] <wikibugs>	 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3144577 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by otto on neodymium.eqiad.wmnet for hosts: ``` ['analytics1046.eqiad.wmnet'] ``` The log can b...
[18:21:40] <jan_drewniak>	 MaxSem: mwdebug1002 and prod are still different... last time we had this issue it was because the date-modified timestamp on the portals repo didn't changed after it was updated, (and thus the files weren't copied over) 
[18:22:28] <jan_drewniak>	 the solution was to `touch portals`and rerun the script
[18:23:44] <logmsgbot>	 !log maxsem@tin Synchronized portals/prod/wikipedia.org/assets: (no justification provided) (duration: 00m 44s)
[18:23:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:28] <godog>	 !log swift eqiad-prod add ms-be1028 -> ms-be1039 - T160640
[18:24:28] <logmsgbot>	 !log maxsem@tin Synchronized portals: (no justification provided) (duration: 00m 44s)
[18:24:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:34] <stashbot>	 T160640: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640
[18:24:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:24:40] <MaxSem>	 jan_drewniak, ^
[18:27:05] <jan_drewniak>	 MaxSem: oy,  still different...
[18:28:21] <logmsgbot>	 !log maxsem@tin Synchronized portals/prod/wikipedia.org/assets: (no justification provided) (duration: 00m 44s)
[18:28:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:29:07] <logmsgbot>	 !log maxsem@tin Synchronized portals: (no justification provided) (duration: 00m 45s)
[18:29:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:29:38] <MaxSem>	 jan_drewniak, ^
[18:32:48] <jan_drewniak>	 MaxSem: can you manually compare the files on staging and production? I'm still seeing the old version in production :/ 
[18:33:43] <Amir1>	 maybe it's varnish cache?
[18:34:10] <Amir1>	 (a wild idea, I have no idea how portals work)
[18:34:16] <icinga-wm>	 PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0]
[18:35:05] <jan_drewniak>	 Amir1: I have no idea how varnish works :P
[18:35:33] <MaxSem>	 fuck, appservers have older files...
[18:35:45] <Amir1>	 https://wikitech.wikimedia.org/wiki/Varnish
[18:35:51] <Amir1>	 This might help :)
[18:36:34] <MaxSem>	 despite even everythng having been touched
[18:36:57] <icinga-wm>	 RECOVERY - puppet last run on db1075 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[18:38:06] <logmsgbot>	 !log maxsem@tin Synchronized portals: (no justification provided) (duration: 00m 44s)
[18:38:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:53] <MaxSem>	 anyone knows what's going on with scap?
[18:39:02] <MaxSem>	 bd808, RainbowSprinkles ^?
[18:40:26] <icinga-wm>	 RECOVERY - puppet last run on mc1031 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[18:40:49] <RainbowSprinkles>	 MaxSem: I've been on vacation for the past week, so no
[18:41:06] <MaxSem>	 woot, congrats on vacation!
[18:43:23] <jan_drewniak>	 MaxSem: sync-dir was recently changed to sync-file in the script, but it seemed like that should improve things https://github.com/wikimedia/portals/commit/ad6eaef5090ef7f706989123a5fcfe6ea22fcb4c
[18:44:43] <MaxSem>	 trying with sync-dir
[18:45:06] <logmsgbot>	 !log maxsem@tin Synchronized portals/prod/wikipedia.org/assets: (no justification provided) (duration: 00m 43s)
[18:45:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:50] <logmsgbot>	 !log maxsem@tin Synchronized portals: (no justification provided) (duration: 00m 44s)
[18:45:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:21] <MaxSem>	 dunno, scap is broken?
[18:47:49] <MaxSem>	 jan_drewniak, I'm going to revert, unless there's another idea
[18:48:15] <RainbowSprinkles>	 MaxSem: sync-dir and sync-file are the same thing fwiw
[18:48:22] <RainbowSprinkles>	 sync-dir is a hidden back-compat alias now
[18:48:49] <MaxSem>	 did something break in process? :}
[18:49:07] <RainbowSprinkles>	 That code has been live for like a month now, so doubt it
[18:50:11] <MaxSem>	 dunno, reverting and filing bug
[18:51:49] <Amir1>	 Several days ago when I wanted to deploy Wikidata extension, sync-dir didn't do anything when I tried to syn-dir the whole extension (there were lots of files touched). but when I did sync-dir and sync-file on smaller directories it worked 
[18:51:50] <Amir1>	 HTH
[18:52:16] <jan_drewniak>	 MaxSem: Yeah, probably deserves a phab task, because this happened last time (touch fixed it then) 
[18:52:44] <MaxSem>	 grrr
[18:54:55] <logmsgbot>	 !log maxsem@tin Synchronized portals/: (no justification provided) (duration: 00m 48s)
[18:55:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:59:31] <MaxSem>	 !log Portals were not deployed: https://phabricator.wikimedia.org/T161832
[18:59:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:00:05] <jouncebot>	 RainbowSprinkles: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T1900). Please do the needful.
[19:01:09] <Amir1>	 so I think my patch in SWAT is not done :(
[19:01:33] <MaxSem>	 :(
[19:02:03] <Amir1>	 I will take it to the next window 
[19:05:31] <Amir1>	 This image acts strange: https://commons.wikimedia.org/wiki/File:Fawiki500k_celebration_by_Behdad_Abedi_(180).jpg
[19:05:59] <Amir1>	 it doesn't show anything and clicking on "original file" gives 404 
[19:06:00] <Amir1>	 https://upload.wikimedia.org/wikipedia/commons/e/e6/Fawiki500k_celebration_by_Behdad_Abedi_%28180%29.jpg
[19:06:15] <Amir1>	 I don't know where should I talk about it though 
[19:25:05] <logmsgbot>	 !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to wmf.18
[19:25:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:33] <icinga-wm>	 PROBLEM - carbon-local-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [100.0]
[19:26:33] <icinga-wm>	 RECOVERY - carbon-local-relay metric drops on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[19:30:13] <icinga-wm>	 PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 385 bytes in 0.010 second response time
[19:31:13] <icinga-wm>	 PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0]
[19:32:13] <icinga-wm>	 RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0]
[19:39:13] <icinga-wm>	 RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.007 second response time
[21:02:44] <wikibugs>	 (03PS1) 10Niharika29: Test LoginNotify on Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345726 (https://phabricator.wikimedia.org/T158878)
[21:05:18] <wikibugs_>	 (03CR) 10Paladox: "@demon / @RainbowSprinkles / @Chad / @Dzahn This may not be needed after all :)" [puppet] - 10https://gerrit.wikimedia.org/r/343736 (owner: 10Paladox)
[21:05:51] <RainbowSprinkles>	 paladox: Ugh please do not ping me 3 different ways like that. I get e-mail notifications.
[21:06:00] <paladox>	 oh sorry
[21:07:49] <logmsgbot>	 !log demon@tin Synchronized php-1.29.0-wmf.18/extensions/Echo/includes/model/Event.php: fix logging class reference (duration: 00m 47s)
[21:07:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:09:50] <wikibugs>	 (03PS1) 10Chad: group2 to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345745
[21:16:04] <wikibugs_>	 06Operations, 10Traffic, 10Wikimedia-Logstash: Investigate 502 errors from nginx when backend returns 302 - https://phabricator.wikimedia.org/T161819#3145357 (10BBlack) Ok, I was wrong in my initial thinking.  Even though we configure `proxy_buffering off;`, `proxy_buffer_size` is still a factor.  Technicall...
[21:17:15] <wikibugs_>	 (03PS1) 10Jdlrobson: Reflect change in purpose of RelatedArticlesFooterBlacklistedSkins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345758 (https://phabricator.wikimedia.org/T160076)
[21:18:05] <wikibugs_>	 06Operations, 10Continuous-Integration-Config, 10Continuous-Integration-Infrastructure: Update npm to 3 or 4 - https://phabricator.wikimedia.org/T155488#3145376 (10Krinkle)
[21:20:44] <wikibugs_>	 (03PS1) 10BBlack: tlsproxy: double the response buffer size [puppet] - 10https://gerrit.wikimedia.org/r/345767 (https://phabricator.wikimedia.org/T161819)
[21:25:13] <wikibugs_>	 (03CR) 10Chad: [C: 032] group2 to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345745 (owner: 10Chad)
[21:26:19] <wikibugs>	 (03Merged) 10jenkins-bot: group2 to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345745 (owner: 10Chad)
[21:27:44] <wikibugs_>	 (03CR) 10jenkins-bot: group2 to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345745 (owner: 10Chad)
[21:34:04] <logmsgbot>	 !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 -> wmf.18
[21:34:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:48:19] <wikibugs_>	 (03CR) 10Dzahn: [C: 031] Remove a number of obsolete conditionals from mediawiki classes [puppet] - 10https://gerrit.wikimedia.org/r/345502 (owner: 10Muehlenhoff)
[22:03:15] <wikibugs_>	 06Operations, 10media-storage: 404 error while accessing some djvu files - https://phabricator.wikimedia.org/T161864#3145437 (10Paladox)
[22:03:18] <wikibugs>	 06Operations, 10media-storage: 404 error while accessing some djvu files - https://phabricator.wikimedia.org/T161864#3145480 (10Wieralee) And another one: https://pl.wikisource.org/wiki/Strona:Wykolejony_(Gruszecki)_24.jpg
[22:04:17] <wikibugs>	 06Operations, 10media-storage: 404 error while accessing some djvu files - https://phabricator.wikimedia.org/T161864#3145482 (10Paladox) Another user reported this on irc  [20:05:31]  <Amir1> This image acts strange: https://commons.wikimedia.org/wiki/File:Fawiki500k_celebration_by_Behdad_Abedi_(180).jpg [20:0...
[22:05:19] <wikibugs>	 06Operations, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161864#3145483 (10Paladox)
[22:05:33] <icinga-wm>	 PROBLEM - carbon-local-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [100.0]
[22:06:33] <icinga-wm>	 RECOVERY - carbon-local-relay metric drops on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[22:19:44] <wikibugs_>	 (03PS2) 10Dzahn: admin: Add an alternative ssh-key for goransm [puppet] - 10https://gerrit.wikimedia.org/r/345587 (owner: 10Tobias Gritschacher)
[22:20:16] <wikibugs>	 (03PS3) 10Dzahn: admin: Add an alternative ssh-key for goransm [puppet] - 10https://gerrit.wikimedia.org/r/345587 (owner: 10Tobias Gritschacher)
[22:21:32] <wikibugs>	 06Operations, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161864#3145501 (10Wieralee) Another file has disappeared: https://commons.wikimedia.org/wiki/File:PL_J%C3%B3zef_Ignacy_Kraszewski-Poezye_tom_2.djvu
[22:23:47] <wikibugs_>	 (03CR) 10Dzahn: [C: 032] admin: Add an alternative ssh-key for goransm [puppet] - 10https://gerrit.wikimedia.org/r/345587 (owner: 10Tobias Gritschacher)
[22:25:03] <icinga-wm>	 PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:27:20] <wikibugs>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to hive for joewalsh - https://phabricator.wikimedia.org/T161663#3145503 (10Dzahn) a:05Dzahn>03Fjalapeno Hi, @Fjalapeno  we'll need your approval for this access request for Joe Walsh please. Could you add it and assign the ticket...
[22:27:59] <wikibugs>	 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Production shell access (request for researchers for pmiazga) - https://phabricator.wikimedia.org/T161658#3145507 (10Dzahn) a:05Dzahn>03pmiazga
[22:28:03] <icinga-wm>	 RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[22:37:42] <wikibugs_>	 (03PS2) 10Dzahn: base: sysctl/check_puppetrun: remove precise remnants [puppet] - 10https://gerrit.wikimedia.org/r/345366
[22:38:21] <andre__>	 There are some recent reports about 404 images on Commons in https://phabricator.wikimedia.org/T161864 - is that known?
[22:42:27] <mutante>	 andre__: i think it's known that it happens "sometimes" 
[22:43:18] <andre__>	 mutante: Yeah I was wondering if it's maybe "more recently more popular" as there are several reports (and someone set UBN priority and I wonder if that's justified) 
[22:43:56] <wikibugs_>	 06Operations, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161864#3145530 (10saper) Same as T161864 reported earlier by @Amire80
[22:46:15] <wikibugs_>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161836#3145534 (10saper) p:05Triage>03Unbreak!
[22:46:46] <RainbowSprinkles>	 I'm not sure it's justified as UBN, but w/e
[22:47:22] * RainbowSprinkles sips a drink from his armchair
[22:47:37] <wikibugs>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161836#3145541 (10saper)
[22:47:40] <wikibugs_>	 06Operations, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161864#3145543 (10saper)
[22:47:59] <mutante>	 andre__: hmm, i tried to find something, i found just the tracking bug   and we had this recently https://phabricator.wikimedia.org/T161360  but that was different all thumbs 
[22:48:24] <andre__>	 mutante, Hmmm.... Okay thanks for checking :-/
[22:48:26] <mutante>	 https://phabricator.wikimedia.org/T43371
[22:48:32] <mutante>	 tracking thing.. afraid that's what i have right now
[22:52:05] <mutante>	 it's probably "High" but not quite "UBN"
[22:54:55] <ankry>	 missing files totally stopped work on plwikisource
[22:56:33] <RainbowSprinkles>	 Well, the bug makes it sound like some files are missing, not all
[22:56:47] <RainbowSprinkles>	 If *all* files are missing, that's UBN
[22:57:41] <ankry>	 when we can expect the files being restored to be able to continue work?
[22:59:42] <aude>	 hi
[23:00:01] <aude>	 i would like to add something to swat, but will take me a few minutes to prepare
[23:00:04] <jouncebot>	 addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170330T2300). Please do the needful.
[23:00:04] <jouncebot>	 RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[23:00:08] <aude>	 (can deploy myself, if easier)
[23:00:27] <RoanKattouw>	 I'm here
[23:00:48] <RoanKattouw>	 aude: OK go ahead and do that after I'm done?
[23:00:50] <RainbowSprinkles>	 ankry: Well, that all depends on what's broken with it. Nobody knows yet :)
[23:00:55] <RainbowSprinkles>	 So, could be 5 minutes, could be 5 years!
[23:00:57] <RainbowSprinkles>	 :D
[23:01:29] <ankry>	 sad to sey people they should stop work for 5 years
[23:01:44] <ankry>	 they will definitely leave the project
[23:01:56] <saper>	 anything in some logs?
[23:02:01] <RainbowSprinkles>	 ankry: My point is: it's impossible to give an ETA until someone debugs it.
[23:02:23] <RainbowSprinkles>	 Also: is it affecting all djvu files, or just some? The bug makes it sound like it's just *some*
[23:02:32] <aude>	 RoanKattouw: ok
[23:02:49] <aude>	 always takes some time to update teh wikidata build and wait for jenkins
[23:02:53] <ankry>	 those that were worked on
[23:02:53] <icinga-wm>	 PROBLEM - puppet last run on kafka1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[23:02:54] <RainbowSprinkles>	 saper: Nothing standing out, but we've had some weird un-parseable djvu errors I've seen the last few weeks/months
[23:03:12] <ankry>	 RainbowSprinkles: those that needed thumbnail generation
[23:03:14] <RainbowSprinkles>	 Basically error message saying "</djvu>" and nothing else
[23:03:14] <aude>	 and cherry pick
[23:03:27] <ankry>	 also some jpg
[23:07:49] <logmsgbot>	 !log catrope@tin Synchronized php-1.29.0-wmf.18/extensions/ORES/modules/: T161706 (duration: 00m 51s)
[23:07:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:07:59] <stashbot>	 T161706: Review ORES prediction visibility on wikis where they are enabled by default - https://phabricator.wikimedia.org/T161706
[23:08:17] <RoanKattouw>	 OK, I'm done
[23:08:54] <aude>	 ok
[23:09:17] <aude>	 waiting for jenkins...
[23:12:36] <wikibugs>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161836#3145574 (10Ankry) Files from  T161864 that disappeared:  https://commons.wikimedia.org/wiki/File:Andrzej_Kijowski_-_Listopadowy_wiecz%C3%B3r.djvu  https://com...
[23:30:53] <icinga-wm>	 RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures
[23:34:31] <aude>	 updateing the wikidata build now
[23:38:41] <wikibugs_>	 (03PS2) 10Gilles: Follow-up fix for hasattr on early error (0.1.37) [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/345608
[23:39:23] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:39:38] <wikibugs_>	 (03PS3) 10Gilles: Upgrade to 0.1.38 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/345608
[23:40:13] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.003 second response time
[23:49:56] <aude>	 ready to deploy
[23:52:35] <aude>	 testing on mwdebug1002
[23:55:05] <aude>	 looks good
[23:55:55] <wikibugs_>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files e.g. djvu and jpg - https://phabricator.wikimedia.org/T161836#3144779 (10EBernhardson) Random debugging:  ``` hphpd> $f = wfFindFile(Title::newFromText('File:Fawiki500k_celebration_by_Behdad_Abedi_(180).jpg')) $f = wfFin...
[23:58:15] <logmsgbot>	 !log aude@tin Synchronized php-1.29.0-wmf.18/extensions/Wikidata: Fixes for special pages (duration: 02m 15s)
[23:58:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:58:59] <aude>	 done