[00:05:15] RECOVERY - puppet last run on radium is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [00:12:05] RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [00:14:05] PROBLEM - puppet last run on analytics1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:17:25] RECOVERY - puppet last run on ms-be3002 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [00:42:05] RECOVERY - puppet last run on analytics1033 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [01:30:15] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 83022.462849 Seconds [01:30:15] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 83336.743009 Seconds [01:30:25] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 83032.225885 Seconds [01:32:05] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 83134.498687 Seconds [01:32:15] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 83457.481693 Seconds [01:32:25] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 83463.405082 Seconds [01:35:05] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [01:39:05] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 83554.484763 Seconds [01:44:05] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [01:47:05] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 84034.577108 Seconds [01:50:25] PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:53:16] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 49.749438 Seconds [01:53:16] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 49.769887 Seconds [01:53:25] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 4.448113 Seconds [01:53:25] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 55.061666 Seconds [01:54:05] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 46.48539 Seconds [01:54:15] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 54.699507 Seconds [02:11:05] PROBLEM - Disk space on labtestcontrol2001 is CRITICAL: DISK CRITICAL - free space: / 347 MB (3% inode=69%) [02:18:25] RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [02:24:18] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.18) (duration: 09m 39s) [02:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:16] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 2134.795114 Seconds [02:50:16] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 2134.800085 Seconds [02:51:16] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [02:51:25] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 2083.863499 Seconds [02:52:25] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [02:54:15] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 2374.8852 Seconds [02:56:35] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [02:57:05] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 2426.033831 Seconds [02:57:25] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [02:58:26] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 2619.697084 Seconds [02:59:05] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [02:59:35] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 2691.360625 Seconds [03:00:25] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 2623.749258 Seconds [03:00:25] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [03:00:35] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [03:01:15] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 2674.096896 Seconds [03:01:25] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [03:02:25] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 2859.822896 Seconds [03:03:15] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [03:03:25] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 2919.656697 Seconds [03:03:25] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [03:03:35] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 2931.275564 Seconds [03:04:25] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 2863.770715 Seconds [03:04:25] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [03:05:25] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [03:05:35] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [03:10:35] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 3351.445726 Seconds [03:11:25] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 3399.89602 Seconds [03:11:35] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [03:12:25] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [03:58:21] (03CR) 10Yuvipanda: [C: 032] "This was built and deployed" [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/345909 (owner: 10Yuvipanda) [04:03:15] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:12:55] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=4963.20 Read Requests/Sec=5542.10 Write Requests/Sec=19.80 KBytes Read/Sec=22240.80 KBytes_Written/Sec=192.40 [04:13:25] PROBLEM - puppet last run on mc1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:17:55] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.50 Read Requests/Sec=3.90 Write Requests/Sec=53.70 KBytes Read/Sec=18.40 KBytes_Written/Sec=348.00 [04:29:15] PROBLEM - puppet last run on elastic1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:30:15] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [04:38:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [04:41:25] RECOVERY - puppet last run on mc1007 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [04:43:45] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [04:57:15] RECOVERY - puppet last run on elastic1050 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [05:27:15] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:49:15] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix typo in discovery name [switchdc] - 10https://gerrit.wikimedia.org/r/345868 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [05:50:59] (03PS2) 10Giuseppe Lavagetto: swift: use discovery url for thumb server [puppet] - 10https://gerrit.wikimedia.org/r/345804 [05:53:55] <_joe_> !log powercycling mw2256, unresponsive to ping, blank console [05:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:15] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [05:55:49] (03PS5) 10Tim Starling: Deploy ParserMigration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344276 (https://phabricator.wikimedia.org/T141586) [05:55:55] RECOVERY - Host mw2256 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms [05:58:05] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:59:50] !log Resume pt-table-checksum on wikidata - T161294 [05:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:56] T161294: run pt-tablechecksum on s5 - https://phabricator.wikimedia.org/T161294 [06:12:23] !log Remove partitions from metawiki.pagelinks (s7) on codfw master (db2029) this will generate lag on codfw - T153300 [06:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:30] T153300: Remove partitions from metawiki.pagelinks in s7 - https://phabricator.wikimedia.org/T153300 [06:12:33] 06Operations: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#3149754 (10Joe) [06:12:46] 06Operations, 15User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#3149766 (10Joe) p:05Triage>03Low [06:18:25] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active [06:18:33] (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Remove db1057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345856 (https://phabricator.wikimedia.org/T160435) (owner: 10Marostegui) [06:19:46] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Remove db1057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345856 (https://phabricator.wikimedia.org/T160435) (owner: 10Marostegui) [06:19:56] (03CR) 10jenkins-bot: db-codfw,db-eqiad.php: Remove db1057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345856 (https://phabricator.wikimedia.org/T160435) (owner: 10Marostegui) [06:20:24] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 13Patch-For-Review, and 6 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#3149770 (10Joe) Status as of now: - DNS based discovery is live and functioning for most things,... [06:21:25] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed [06:21:32] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Remove db1057 entry - T160435 (duration: 00m 54s) [06:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:39] T160435: db1057 does not react to powercycle/powerdown/powerup commands - https://phabricator.wikimedia.org/T160435 [06:23:25] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare and improve the datacenter switchover procedure - https://phabricator.wikimedia.org/T154658#3149779 (10Joe) [06:25:01] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Remove db1057 entry - T160435 (duration: 00m 44s) [06:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:25:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [06:27:05] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:30:45] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 18 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [06:33:57] (03PS1) 10Marostegui: db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346102 [06:36:15] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [06:36:25] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active [06:36:56] RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 149 bytes in 0.074 second response time [06:38:22] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346102 (owner: 10Marostegui) [06:39:01] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 13Patch-For-Review, and 6 others: Integrating MediaWiki (and other services) with dynamic configuration - https://phabricator.wikimedia.org/T149617#3149782 (10Joe) [06:39:04] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 13Patch-For-Review, and 6 others: DNS: dynamically generate entries for service discovery - https://phabricator.wikimedia.org/T156100#3149781 (10Joe) 05Open>03Resolved [06:39:05] RECOVERY - Disk space on labtestcontrol2001 is OK: DISK OK [06:39:35] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346102 (owner: 10Marostegui) [06:39:44] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346102 (owner: 10Marostegui) [06:40:54] <_joe_> !log manually restarted replication for etcd [06:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:21] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1070 to compress it - T153743 (duration: 00m 44s) [06:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:27] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [06:42:55] 06Operations, 07RfC, 06Services (watching), 15User-Joe, and 2 others: [RFC] Define the on-disk and live structure of etcd pool data - https://phabricator.wikimedia.org/T100793#3149786 (10Joe) 05Open>03declined [06:43:34] 06Operations, 07RfC, 06Services (watching), 15User-Joe, and 2 others: [RFC] Define the on-disk and live structure of etcd pool data - https://phabricator.wikimedia.org/T100793#1320581 (10Joe) This has been practically superseded by so many specific tickets it doesn't really make much sense anymore. [06:51:35] RECOVERY - puppet last run on copper is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:51:54] !log Deploy InnoDB compression on dewiki - db1070 - T150438 [06:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:01] T150438: Meta ticket: Deploy InnoDB compression where possible - https://phabricator.wikimedia.org/T150438 [07:00:55] PROBLEM - Check systemd state on copper is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:03:16] !log instaling gnutls security updates on trusty [07:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:12] <_joe_> !log removing stale files on copper for docker, all local images will be wiped away [07:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:24] (03PS2) 10Muehlenhoff: Change email address for Yuvi [puppet] - 10https://gerrit.wikimedia.org/r/344133 [07:08:53] (03CR) 10Muehlenhoff: [C: 032] Change email address for Yuvi [puppet] - 10https://gerrit.wikimedia.org/r/344133 (owner: 10Muehlenhoff) [07:09:35] PROBLEM - puppet last run on copper is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 25 seconds ago with 1 failures. Failed resources (up to 3 shown): Service[docker] [07:09:43] (03PS2) 10Muehlenhoff: Install jessie systems with Linux 4.9 by default [puppet] - 10https://gerrit.wikimedia.org/r/345314 (https://phabricator.wikimedia.org/T154934) [07:14:43] !log switched default kernel for jessie installations to Linux 4.9 [07:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:55] RECOVERY - Check systemd state on copper is OK: OK - running: The system is fully operational [07:25:12] <_joe_> !log rebooting copper to clean up at least partially the docker mess [07:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:51] !log Deploy alter table dbstore2001 (s7) on revision table to unify PK and indexes - T160390 [07:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:57] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [07:27:35] RECOVERY - puppet last run on copper is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [07:32:15] PROBLEM - Apache HTTP on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:32:15] PROBLEM - Nginx local proxy to apache on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:32:16] PROBLEM - HHVM rendering on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:33:25] PROBLEM - puppet last run on db1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:35:17] (03CR) 10Marostegui: [C: 031] "I would test this manually first for some days with manual tables, before letting it run by itself" [puppet] - 10https://gerrit.wikimedia.org/r/345646 (https://phabricator.wikimedia.org/T124307) (owner: 10Ottomata) [07:39:04] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1261.eqiad.wmnet [07:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:16] moritzm: --^ [07:39:51] elukey: please don't restart yet [07:40:05] sure sure, hhvm-dump-debug in /tmp/hhvm.60283.bt. [07:40:07] I repooled it to let it crash to collect more information [07:40:09] thanks :-) [07:40:14] ahhh sorry! [07:40:23] did you restart? [07:40:27] nono [07:40:34] ok, great [07:40:48] I just wanted to remove traffic since I didn't see you online [07:41:04] <_joe_> elukey: traffic is removed from pybal already [07:41:54] _joe_ sure but there is the case that the host might show intermittent failures, just wanted to be sure :) [07:44:13] (03PS1) 10Alexandros Kosiaris: certspotter: Silence the cronspam [puppet] - 10https://gerrit.wikimedia.org/r/346103 [07:53:32] (03Abandoned) 10Giuseppe Lavagetto: salt: add conftool module [puppet] - 10https://gerrit.wikimedia.org/r/295202 (owner: 10Giuseppe Lavagetto) [08:02:25] RECOVERY - puppet last run on db1065 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [08:15:04] (03PS1) 10Gehel: maps - collect OSM sync lag to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/346106 [08:23:05] PROBLEM - puppet last run on uranium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:25:49] (03PS1) 10Marostegui: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346108 (https://phabricator.wikimedia.org/T160390) [08:29:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346108 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [08:30:16] (03PS2) 10Gehel: maps - collect OSM sync lag to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/346106 (https://phabricator.wikimedia.org/T160011) [08:30:40] (03PS1) 10Volans: Puppet: do not deactivate hosts in PuppetDB automatically [puppet] - 10https://gerrit.wikimedia.org/r/346110 (https://phabricator.wikimedia.org/T159163) [08:34:15] RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.672 second response time [08:34:15] RECOVERY - Nginx local proxy to apache on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.673 second response time [08:34:15] RECOVERY - HHVM rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 77987 bytes in 0.925 second response time [08:34:36] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346108 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [08:34:48] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1086 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346108 (https://phabricator.wikimedia.org/T160390) (owner: 10Marostegui) [08:38:25] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1086 - T160390 (duration: 00m 44s) [08:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:31] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [08:39:16] (03PS3) 10Alexandros Kosiaris: maps - collect OSM sync lag to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/346106 (https://phabricator.wikimedia.org/T160011) (owner: 10Gehel) [08:39:25] (03CR) 10Alexandros Kosiaris: [C: 031] maps - collect OSM sync lag to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/346106 (https://phabricator.wikimedia.org/T160011) (owner: 10Gehel) [08:39:28] (03CR) 10Volans: "Puppet compiler results:" [puppet] - 10https://gerrit.wikimedia.org/r/346110 (https://phabricator.wikimedia.org/T159163) (owner: 10Volans) [08:40:37] 06Operations, 07Puppet, 13Patch-For-Review: PuppetDB is auto-deactivating hosts - https://phabricator.wikimedia.org/T159163#3150000 (10Volans) @Joe @akosiaris, actually looks like this is a NOOP on the puppetmasters, but a change on just the puppetdb hosts: ``` $ sudo cumin --dry-run 'R:class = puppetmaster... [08:41:14] (03CR) 10Gehel: [C: 032] maps - collect OSM sync lag to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/346106 (https://phabricator.wikimedia.org/T160011) (owner: 10Gehel) [08:42:26] (03CR) 10Alexandros Kosiaris: [C: 031] "LGTM but keep in mind this will cause IRC alert spam" [puppet] - 10https://gerrit.wikimedia.org/r/346110 (https://phabricator.wikimedia.org/T159163) (owner: 10Volans) [08:42:48] (03PS2) 10Alexandros Kosiaris: certspotter: Silence the cronspam [puppet] - 10https://gerrit.wikimedia.org/r/346103 [08:42:54] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] certspotter: Silence the cronspam [puppet] - 10https://gerrit.wikimedia.org/r/346103 (owner: 10Alexandros Kosiaris) [08:43:35] 06Operations, 07Puppet, 13Patch-For-Review: PuppetDB is auto-deactivating hosts - https://phabricator.wikimedia.org/T159163#3150002 (10Volans) Ops, I read the previous message as it required a restart of puppetmasters, not puppetdb, sorry for the misunderstanding. [08:45:47] akosiaris: there's T159137 open FYI [08:45:48] T159137: certspotter: Error retrieving STH from log - https://phabricator.wikimedia.org/T159137 [08:46:26] !log Deploy alter table db1086 (s7) on revision table to unify PK and indexes - T160390 [08:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:34] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [08:49:08] (03PS1) 10Elukey: Fix Redis Hiera configuration for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/346112 [08:49:36] hashar: --^ [08:49:56] do you prefer that I only fix the inconsistecy or that https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep will be copied over? [08:51:05] RECOVERY - puppet last run on uranium is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [08:53:01] (03CR) 10Giuseppe Lavagetto: [C: 031] "this was a brainfart on my side the other day." [puppet] - 10https://gerrit.wikimedia.org/r/346112 (owner: 10Elukey) [08:53:24] akosiaris: mmh, it looks like https://gerrit.wikimedia.org/r/346103 would also silence valid certspotter cron emails? [08:55:47] (03CR) 10Elukey: [C: 032] Fix Redis Hiera configuration for deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/346112 (owner: 10Elukey) [08:57:25] ema: hmm [08:57:40] maybe I should revert indeed [08:57:47] <_joe_> yeah that was my doubt as well [08:57:54] <_joe_> I wanted to suggest to use a log file [08:57:56] looks like the point of that cron IS to e-mail out [08:58:07] akosiaris: it is :) [08:58:10] yeah reverting, need to rethink [08:58:17] thanks for pointing out [08:58:46] sure [08:59:47] 06Operations, 06Analytics-Kanban, 06WMDE-Analytics-Engineering, 13Patch-For-Review, 15User-Addshore: /a/mw-log/archive/api on stat1002 no longer being populated - https://phabricator.wikimedia.org/T160888#3113734 (10Addshore) >>! In T160888#3140077, @elukey wrote: > @Addshore: I am going to close this ta... [09:01:15] (03PS1) 10Alexandros Kosiaris: Revert "certspotter: Silence the cronspam" [puppet] - 10https://gerrit.wikimedia.org/r/346116 [09:03:36] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "certspotter: Silence the cronspam" [puppet] - 10https://gerrit.wikimedia.org/r/346116 (owner: 10Alexandros Kosiaris) [09:03:41] (03PS2) 10Alexandros Kosiaris: Revert "certspotter: Silence the cronspam" [puppet] - 10https://gerrit.wikimedia.org/r/346116 [09:03:43] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "certspotter: Silence the cronspam" [puppet] - 10https://gerrit.wikimedia.org/r/346116 (owner: 10Alexandros Kosiaris) [09:03:47] (03Abandoned) 10Giuseppe Lavagetto: swift: use discovery url for thumb server [puppet] - 10https://gerrit.wikimedia.org/r/345804 (owner: 10Giuseppe Lavagetto) [09:07:05] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 875.72 seconds [09:07:44] ^ checking [09:09:04] create table select….Queried about 650140000 [09:09:08] from research [09:22:26] <_joe_> marostegui: ahah [09:22:35] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:23:01] marostegui: my bet is on quarterly report ;) [09:27:16] (03PS1) 10Giuseppe Lavagetto: base::puppet: add disable-puppet script [puppet] - 10https://gerrit.wikimedia.org/r/346118 [09:28:33] (03CR) 10jerkins-bot: [V: 04-1] base::puppet: add disable-puppet script [puppet] - 10https://gerrit.wikimedia.org/r/346118 (owner: 10Giuseppe Lavagetto) [09:30:30] (03PS2) 10Giuseppe Lavagetto: base::puppet: add disable-puppet script [puppet] - 10https://gerrit.wikimedia.org/r/346118 [09:36:05] PROBLEM - puppet last run on restbase-dev1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:45:32] the problem is that CREATE ... SELECT or INSERT...SELECT should only be run on transactional tables [09:46:10] the size is not that important, that server is ok to do that [09:50:35] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [10:05:06] RECOVERY - puppet last run on restbase-dev1001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [10:09:05] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 200.00 seconds [10:16:56] (03CR) 10Giuseppe Lavagetto: [C: 031] Swift-proxy: use discovery everywhere for rewrites [puppet] - 10https://gerrit.wikimedia.org/r/345860 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [10:20:54] 06Operations, 07Documentation: Improve SSH access information in onboarding documentation - https://phabricator.wikimedia.org/T160941#3150151 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03MoritzMuehlenhoff I'll take care of that. [10:27:44] (03PS1) 10Muehlenhoff: Remove access credentials for csteipp [puppet] - 10https://gerrit.wikimedia.org/r/346125 [10:28:05] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:29:06] 06Operations, 10Traffic: Select or Acquire Address Space for Asia Cache DC - https://phabricator.wikimedia.org/T156256#3150178 (10faidon) After a few back and forths and a lot of supporting documentation, we've passed the verification step of the process and we moved on to the next step as of today: > At this... [10:30:53] (03PS2) 10Volans: Swift-proxy: use discovery everywhere for rewrites [puppet] - 10https://gerrit.wikimedia.org/r/345860 (https://phabricator.wikimedia.org/T160178) [10:38:14] !log upgrading swift-proxy in eqiad to use discovery URLs [10:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:31] (03CR) 10Volans: [C: 032] Swift-proxy: use discovery everywhere for rewrites [puppet] - 10https://gerrit.wikimedia.org/r/345860 (https://phabricator.wikimedia.org/T160178) (owner: 10Volans) [10:41:35] (03PS2) 10Faidon Liambotis: ssh: update comments to remove precise mentions [puppet] - 10https://gerrit.wikimedia.org/r/345834 [10:41:38] (03PS2) 10Faidon Liambotis: puppet: remove fail() guard for precise [puppet] - 10https://gerrit.wikimedia.org/r/345835 [10:41:39] (03PS3) 10Faidon Liambotis: mediawiki: kill more precise references [puppet] - 10https://gerrit.wikimedia.org/r/345546 [10:41:42] (03PS3) 10Faidon Liambotis: hhvm: kill a precise reference [puppet] - 10https://gerrit.wikimedia.org/r/345547 [10:41:44] (03PS3) 10Faidon Liambotis: apache: remove precise (and Apache 2.2) support [puppet] - 10https://gerrit.wikimedia.org/r/345548 [10:41:46] (03PS1) 10Faidon Liambotis: Remove Apache across the tree [puppet] - 10https://gerrit.wikimedia.org/r/346128 [10:42:36] (03CR) 10Faidon Liambotis: [V: 032 C: 032] ssh: update comments to remove precise mentions [puppet] - 10https://gerrit.wikimedia.org/r/345834 (owner: 10Faidon Liambotis) [10:42:54] (03CR) 10Faidon Liambotis: [C: 032] puppet: remove fail() guard for precise [puppet] - 10https://gerrit.wikimedia.org/r/345835 (owner: 10Faidon Liambotis) [10:43:20] !log volans@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,name=ms-fe1005.eqiad.wmnet [10:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:25] 06Operations: videoscalers (mw1168, mw1169) - high load / overheating - https://phabricator.wikimedia.org/T161918#3147165 (10MoritzMuehlenhoff) This ticket has (at least) two independant tasks: (1) Fine-tuning the video scaler queue and (2) applying thermal paste to the video scalers (which turned out to be effe... [10:43:52] 06Operations, 10ops-ulsfo: decommission backup4001 - https://phabricator.wikimedia.org/T161904#3150208 (10MoritzMuehlenhoff) p:05Triage>03Normal [10:45:26] 06Operations: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029#3150214 (10MoritzMuehlenhoff) [10:50:32] (03CR) 10Muehlenhoff: [C: 031] releases: remove the precise suite [puppet] - 10https://gerrit.wikimedia.org/r/345838 (owner: 10Faidon Liambotis) [10:50:49] (03PS1) 10Joal: Update analytics-cluster refinery cron regularity [puppet] - 10https://gerrit.wikimedia.org/r/346129 [10:50:55] elukey: --^ [10:52:12] (03CR) 10jerkins-bot: [V: 04-1] Update analytics-cluster refinery cron regularity [puppet] - 10https://gerrit.wikimedia.org/r/346129 (owner: 10Joal) [10:53:38] jouncebot: refresh [10:53:44] I refreshed my knowledge about deployments. [10:53:46] jouncebot: next [10:53:46] In 2 hour(s) and 6 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170403T1300) [10:57:05] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [11:01:11] 06Operations, 07HHVM: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ) - https://phabricator.wikimedia.org/T161684#3150238 (10MoritzMuehlenhoff) p:05Triage>03High a:03MoritzMuehlenhoff [11:01:44] 06Operations, 10hardware-requests: EQIAD: (4) hardware access request for ganeti - https://phabricator.wikimedia.org/T161702#3150240 (10MoritzMuehlenhoff) p:05Triage>03Normal [11:01:46] 06Operations, 10hardware-requests: COFW: (2) hardware access request for ganeti - https://phabricator.wikimedia.org/T161701#3150241 (10MoritzMuehlenhoff) p:05Triage>03Normal [11:01:53] 06Operations, 10hardware-requests: CODFW: (4) hardware access request for kubernetes - https://phabricator.wikimedia.org/T161700#3150242 (10MoritzMuehlenhoff) p:05Triage>03Normal [11:02:06] PROBLEM - puppet last run on mw1248 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:04:15] !log volans@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,name=ms-fe1005.eqiad.wmnet [11:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:50] !log volans@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,name=ms-fe1006.eqiad.wmnet [11:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:31] (03PS2) 10Joal: Update analytics-cluster refinery cron [puppet] - 10https://gerrit.wikimedia.org/r/346129 [11:06:40] elukey: patched --^ [11:07:40] (03CR) 10jerkins-bot: [V: 04-1] Update analytics-cluster refinery cron [puppet] - 10https://gerrit.wikimedia.org/r/346129 (owner: 10Joal) [11:08:56] !log volans@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,name=ms-fe1006.eqiad.wmnet [11:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:12] 06Operations, 06Performance-Team, 10Traffic: What happened 2017-03-09 04:00 - 06:00 UTC - https://phabricator.wikimedia.org/T160109#3150245 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff I think the immediate question as to what happened 2017-03-09 04:00 - 06:00 UTC has been resolved, so... [11:10:51] joal: jenkins is upset by your alignment of => :) [11:11:17] elukey: I actually didn't do that, I used existing patch [11:12:31] joal: nono I mean that the arrow in monthday => '1' is not aligned with the rest of the cron block [11:12:39] (the other arrows) [11:12:45] so puppet lint is not happy :) [11:13:01] elukey: Ahhhh ! [11:13:04] elukey: patching [11:13:24] * joal is ignorant in puppet - and in linting even more [11:13:29] git st [11:13:32] oop [11:14:20] (03PS3) 10Joal: Update analytics-cluster refinery cron [puppet] - 10https://gerrit.wikimedia.org/r/346129 [11:16:17] (03CR) 10Elukey: [C: 032] Update analytics-cluster refinery cron [puppet] - 10https://gerrit.wikimedia.org/r/346129 (owner: 10Joal) [11:18:03] !log volans@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,name=ms-fe1007.eqiad.wmnet [11:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:35] Thanks elukey [11:21:42] !log volans@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,name=ms-fe1007.eqiad.wmnet [11:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:18] !log volans@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,name=ms-fe1008.eqiad.wmnet [11:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:26] !log joal@tin Started deploy [analytics/refinery@cc73c40]: (no justification provided) [11:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:46] (03Abandoned) 10Addshore: Add wmde ldap group to grafana [puppet] - 10https://gerrit.wikimedia.org/r/333024 (https://phabricator.wikimedia.org/T161484) (owner: 10Addshore) [11:31:05] RECOVERY - puppet last run on mw1248 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:31:45] !log volans@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,name=ms-fe1008.eqiad.wmnet [11:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:49] !log joal@tin Finished deploy [analytics/refinery@cc73c40]: (no justification provided) (duration: 07m 23s) [11:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:33] (03PS2) 10Muehlenhoff: Remove access credentials for csteipp [puppet] - 10https://gerrit.wikimedia.org/r/346125 [11:39:05] (03CR) 10Muehlenhoff: [C: 032] Remove access credentials for csteipp [puppet] - 10https://gerrit.wikimedia.org/r/346125 (owner: 10Muehlenhoff) [11:40:42] 06Operations: videoscalers (mw1168, mw1169) - high load / overheating - https://phabricator.wikimedia.org/T161918#3150330 (10elukey) Yes there is some work to do for 1), I'll take care of it in a separate code review. For this particular issue, namely the videoscalers alarming, I am not sure what fixed it, since... [11:45:32] !log volans@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,name=ms-fe1006.eqiad.wmnet [11:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:35] 06Operations, 10Pybal, 10Traffic: Unhandled pybal ValueError: need more than 1 value to unpack - https://phabricator.wikimedia.org/T143078#3150356 (10ema) 05Open>03Resolved a:03ema Confirmed, upgrading twisted to 16.2.0 fixed this. [11:49:48] !log volans@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,name=ms-fe1006.eqiad.wmnet [11:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:06] (03CR) 10Liuxinyu970226: [C: 04-1] "I'm sorry, but currently we should not handle khw here, see https://lists.wikimedia.org/pipermail/langcom/2017-April/001207.html" [puppet] - 10https://gerrit.wikimedia.org/r/343584 (https://phabricator.wikimedia.org/T160865) (owner: 10Dereckson) [11:51:38] (03CR) 10Liuxinyu970226: [C: 04-1] RESTBase: add kbp. and khw.wikipedia (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/343584 (https://phabricator.wikimedia.org/T160865) (owner: 10Dereckson) [11:52:51] 06Operations, 06Office-IT, 07LDAP: Remove disabled users from internal mailing lists - https://phabricator.wikimedia.org/T161004#3150359 (10MoritzMuehlenhoff) @mmodell : Disabled mail accounts should be a problem independant of disabled @wikimedia.org accounts, can you describe how Phabricator handles those?... [11:53:26] Hi [11:54:52] 06Operations, 10Wikimedia-Site-requests, 10media-storage, 15User-Urbanecm: Solve missing 200px size of File:Status_iucn3.1_LC_cs.svg - https://phabricator.wikimedia.org/T162035#3150362 (10Urbanecm) [11:55:13] 06Operations, 10Wikimedia-Site-requests, 10media-storage, 15User-Urbanecm: Solve missing 200px size of File:Status_iucn3.1_LC_cs.svg - https://phabricator.wikimedia.org/T162035#3150376 (10Urbanecm) p:05Triage>03Unbreak! Breaking change => UBN! [11:55:30] Hi all, may somebody have a look at T162035 ? [11:55:30] T162035: Solve missing 200px size of File:Status_iucn3.1_LC_cs.svg - https://phabricator.wikimedia.org/T162035 [11:56:09] (03Abandoned) 10Dereckson: RESTBase: add kbp. and khw.wikipedia [puppet] - 10https://gerrit.wikimedia.org/r/343584 (https://phabricator.wikimedia.org/T160865) (owner: 10Dereckson) [11:57:03] Urbanecm: sorry to ask but what should be the image displayed? [11:57:34] I can see something but not sure if it is the right one or not [11:57:46] elukey, I can't understand. You can have a look at https://cs.wikipedia.org/w/index.php?title=Wikipedie:Pod_l%C3%ADpou_(technika)&oldid=14873106#Dal.C5.A1.C3.AD_problematick.C3.BD_obr.C3.A1zek what it displays now. [11:58:17] elukey, ah, you want correct image. It should be https://upload.wikimedia.org/wikipedia/commons/thumb/5/57/Status_iucn3.1_LC_cs.svg/201px-Status_iucn3.1_LC_cs.svg.png (but of course 1 px smaller) [11:58:27] I've linked the image in the task BTW [11:59:18] 06Operations, 10Wikimedia-Site-requests, 10media-storage, 15User-Urbanecm: Solve missing 200px size of File:Status_iucn3.1_LC_cs.svg - https://phabricator.wikimedia.org/T162035#3150400 (10Urbanecm) [11:59:34] Urbanecm: yes thanks, I can see your link correctly then on my browser (I am going through the esams cache though) [11:59:56] what error do you see? [11:59:58] elukey, I live in EU. Do I use another cluster? I don't know shortcuts. [12:00:10] (03PS3) 10Hashar: Test for throttle rules: parameters logic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318301 (owner: 10Dereckson) [12:00:22] ERR_CONTENT_DECODING_FAILED [12:00:48] Last I saw this the server was sending compressed header even the content wasn't compressed (or vica-versa, I can't remember it exactly...) [12:00:58] elukey, ^^^ [12:01:42] BTW when I try to download it using wget, it downloads correctly... [12:02:10] Urbanecm: confirmed, I can also reproduce the problem [12:02:30] good :) [12:02:54] elukey, BTW one other user reported it at a wiki page at cswiki (and I converted the report to Phabricator). [12:02:58] Content Encoding Error The page you are trying to view cannot be shown because it uses an invalid or unsupported form of compression. [12:04:08] let's update the task's description with more info :) [12:06:09] elukey, should I update it? Or was it for somebody else? [12:06:35] (03CR) 10Hashar: [C: 032] Test for throttle rules: parameters logic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318301 (owner: 10Dereckson) [12:06:45] (03CR) 10Hashar: [C: 032] "Thanks for the follow up!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318301 (owner: 10Dereckson) [12:07:32] 06Operations, 07Puppet, 06Discovery, 06Maps, 03Interactive-Sprint: Puppet fails with "Could not find init script for 'postgresql@9.4-main'" on maps / labs server - https://phabricator.wikimedia.org/T161893#3150410 (10MoritzMuehlenhoff) p:05Triage>03Normal [12:07:34] Urbanecm: if you could (including browsers that you tried etc..) it would be great [12:07:46] elukey, okay, working on it [12:07:58] (03Merged) 10jenkins-bot: Test for throttle rules: parameters logic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318301 (owner: 10Dereckson) [12:08:07] (03CR) 10jenkins-bot: Test for throttle rules: parameters logic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/318301 (owner: 10Dereckson) [12:08:43] ^^^ I have rebased on tin.eqiad.wmnet . That only a test filel [12:13:57] 06Operations, 10Wikimedia-Site-requests, 10media-storage, 15User-Urbanecm: Solve missing 200px size of File:Status_iucn3.1_LC_cs.svg - https://phabricator.wikimedia.org/T162035#3150426 (10Urbanecm) [12:14:05] ema, elukey, updated ^^^^ [12:14:14] thanks :) [12:14:16] yw [12:14:46] Isn't that bug a dupe? [12:15:12] https://phabricator.wikimedia.org/T161836 [12:16:26] Reedy: I think those are two separate issues [12:16:46] 06Operations, 10Wikimedia-Site-requests, 10media-storage, 15User-Urbanecm: Solve missing 200px size of File:Status_iucn3.1_LC_cs.svg - https://phabricator.wikimedia.org/T162035#3150431 (10Urbanecm) [12:22:42] 06Operations, 10Wikimedia-Site-requests, 10media-storage, 15User-Urbanecm: Solve missing 200px size of File:Status_iucn3.1_LC_cs.svg - https://phabricator.wikimedia.org/T162035#3150438 (10Aklapper) p:05Unbreak!>03High The file is not missing, it just has a wrong type and cannot be rendered ([text/html]... [12:23:16] 06Operations, 10Wikimedia-Site-requests, 10media-storage, 15User-Urbanecm: Specific PNG thumbnail delivered as [text/html] instead of [image/png] and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3150440 (10Aklapper) [12:27:20] Reedy, I think so too. In the bug you linked you receive 404 but there an image with MIME type text/html (with no visible reason) [12:28:51] !log banning 200px-Status_iucn3.1_LC_cs.svg.png from esams frontends T162035 [12:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:57] T162035: Specific PNG thumbnail delivered as [text/html] instead of [image/png] and hence not rendered in browser - https://phabricator.wikimedia.org/T162035 [12:29:00] (03PS1) 10Bmansurov: enwiki: Temporarily disable Wikidata descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346136 (https://phabricator.wikimedia.org/T161805) [12:29:16] PROBLEM - puppet last run on mw1195 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:29:49] Urbanecm: can you re check if you still see the issue? [12:31:05] PROBLEM - Hadoop NodeManager on analytics1028 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [12:31:21] this is me [12:31:30] elukey, yes, confirmed. [12:31:39] thanks! [12:31:45] elukey, but when I refreshed, the problem has solved. [12:31:50] Maybe caching of errors? [12:32:26] ema just banned the item from the esam cache (the one that you are hitting) [12:32:54] Thank you ema ! [12:33:00] 06Operations, 10Traffic, 10Wikimedia-Site-requests, 10media-storage, 15User-Urbanecm: Specific PNG thumbnail delivered as [text/html] instead of [image/png] and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3150458 (10ema) [12:33:16] 06Operations, 10Traffic, 10Wikimedia-Site-requests, 10media-storage, 15User-Urbanecm: Specific PNG thumbnail delivered as [text/html] instead of [image/png] and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3150362 (10Nemo_bis) Same as T162033? [12:33:46] 06Operations, 10Traffic, 10media-storage, 15User-Urbanecm: Specific PNG thumbnail delivered as [text/html] instead of [image/png] and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3150463 (10Nemo_bis) [12:33:48] 06Operations, 10Traffic, 10media-storage, 15User-Urbanecm: Specific PNG thumbnail delivered as [text/html] instead of [image/png] and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3150362 (10stjn) https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/Flag_of_Cross_of_Burgund... [12:34:05] RECOVERY - Hadoop NodeManager on analytics1028 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [12:34:07] 06Operations, 10Traffic, 10media-storage, 15User-Urbanecm: Specific PNG thumbnail delivered as [text/html] instead of [image/png] and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3150467 (10Urbanecm) It seems like it. [12:37:22] 06Operations, 10Traffic: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#3150472 (10MoritzMuehlenhoff) Text looks good to me, but two points: - "(upgrade or use Firefox!)" is somethat confusing since people might think an updated IE would be av... [12:37:44] !log reimage analytics10[29,30,31] to Debian Jessie [12:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:21] 06Operations, 10Traffic, 10media-storage, 15User-Urbanecm: Specific PNG thumbnail delivered as [text/html] instead of [image/png] and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3150475 (10Urbanecm) [12:51:57] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3150500 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analytics1029.eqiad.wmnet', 'analytics1030.... [12:54:53] Urbanecm: thank you! [12:55:07] ema, you're welcome! [12:56:46] ema: volans can we proceed with the mediawiki SWAT? [12:57:00] or should we hold due to the PNG/thumb madness? [12:57:35] RECOVERY - puppet last run on mw1195 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [12:57:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [12:58:28] hashar: are you doing the swat? [12:58:42] i can [12:58:53] I'm here :) [12:59:22] hashar: not sure, still under investigation, I would probably hold a bit if not urgent but ask ema ;) [12:59:34] hashar: I can too, if the patches look good to you [12:59:43] lets review the patches [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170403T1300). [13:00:04] Urbanecm and bmansurov: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:11] here [13:00:12] pretty sure none of them are related to thumb nailing ema :} [13:00:30] (03PS2) 10Hashar: enwiki: Temporarily disable Wikidata descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346136 (https://phabricator.wikimedia.org/T161805) (owner: 10Bmansurov) [13:00:51] hashar: yeah go ahead [13:00:54] zeljkof: can you deploy https://gerrit.wikimedia.org/r/#/c/346136/ for bmansurov please ? [13:00:58] I am reviewing the other patches [13:01:05] hashar: sure [13:01:13] (03CR) 10Hashar: [C: 031] enwiki: Temporarily disable Wikidata descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346136 (https://phabricator.wikimedia.org/T161805) (owner: 10Bmansurov) [13:01:25] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:02:15] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [13:02:32] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346136 (https://phabricator.wikimedia.org/T161805) (owner: 10Bmansurov) [13:02:45] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [13:03:34] (03Merged) 10jenkins-bot: enwiki: Temporarily disable Wikidata descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346136 (https://phabricator.wikimedia.org/T161805) (owner: 10Bmansurov) [13:03:47] (03CR) 10jenkins-bot: enwiki: Temporarily disable Wikidata descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346136 (https://phabricator.wikimedia.org/T161805) (owner: 10Bmansurov) [13:03:58] (03PS2) 10Hashar: Add NS100 (Portal) to ladwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345647 (https://phabricator.wikimedia.org/T161843) (owner: 10Urbanecm) [13:04:00] (03PS2) 10Hashar: Add rollback user group in fawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345964 (https://phabricator.wikimedia.org/T161946) (owner: 10Urbanecm) [13:04:02] (03PS2) 10Hashar: Optimalize all not-optimalized logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346057 (https://phabricator.wikimedia.org/T161999) (owner: 10Urbanecm) [13:04:04] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346144 [13:04:07] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346144 [13:04:13] all the rest looks fine and I rebased them [13:05:08] bmansurov: can 346136 be tested at mwdebug1002? [13:05:22] zeljkof, let me see [13:05:32] bmansurov: will be there in a minute [13:05:54] ok [13:06:43] bmansurov: the patch is at mwdebug1002, please test [13:06:51] zeljkof, testing [13:07:59] Hello [13:08:16] I've a cherry pick for wmf28 to add. [13:08:25] zeljkof, working! thanks for deploying. [13:08:41] bmansurov: ok, deploying to cluster then [13:08:52] ok [13:09:35] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:346136|enwiki: Temporarily disable Wikidata descriptions (T161805)]] (duration: 00m 45s) [13:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:44] T161805: Turn tagline wikidata descriptions off in enwiki - https://phabricator.wikimedia.org/T161805 [13:09:49] Dereckson: can you add it to https://wikitech.wikimedia.org/wiki/Deployments#Monday.2C.C2.A0April.C2.A003 please ?:) [13:09:58] bmansurov: deployed, please check on production [13:10:03] Dereckson: i will CR+2 it [13:10:41] zeljkof, tested, working. [13:10:44] Urbanecm: please always use [config] for anything in operations/mediawiki-config, as the current use of this task is to indicate the place we want to go to deploy it actually [13:10:47] zeljkof, thanks again. [13:11:01] bmansurov: great, thanks for deploying with #releng ;) [13:11:19] Dereckson, okay. [13:11:21] ;) [13:11:26] hashar: added, thanks [13:11:26] Urbanecm: around for swat? [13:11:34] zeljkof, yeah [13:11:57] hashar: can I continue with Urbanecm's patches? did you review them? [13:12:33] Urbanecm: so for example config = /srv/ [13:12:44] * Dereckson pressed enter too soon [13:12:49] Urbanecm: so for example config = /srv/ [13:13:02] copy paste issues day [13:13:20] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: Use SSL certificates with discovery entry for elasticsearch - https://phabricator.wikimedia.org/T162037#3150511 (10Gehel) [13:13:55] Dereckson, I see two same posts. [13:14:01] (03PS1) 10Gehel: elasticsearch - create LVS service for relforge [dns] - 10https://gerrit.wikimedia.org/r/346146 (https://phabricator.wikimedia.org/T162037) [13:14:19] But okay, I'll use [config] for every patch in operations/mediawiki-config [13:14:34] Urbanecm: yes, so what I wanted to say is the tag allows to determine the working directory: for example config = /srv/mediawiki-staging, the root deploy directory, wmf18 = /srv/mediawiki-staging/php-1.29.0-wmf.18, wmf19 = /srv/mediawiki-staging/php-1.29.0-wmf.19 etc. [13:14:52] Now I understand. [13:14:55] Thank you [13:15:23] !log upgrading restbase-dev* to Linux 4.9 [13:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:14] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 13Patch-For-Review: Use SSL certificates with discovery entry for elasticsearch - https://phabricator.wikimedia.org/T162037#3150531 (10Gehel) I think that the transfer_to_es job is using a specific node instead of the service to simplify fir... [13:18:41] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345647 (https://phabricator.wikimedia.org/T161843) (owner: 10Urbanecm) [13:18:44] 06Operations, 10Traffic, 10media-storage, 15User-Urbanecm: Specific PNG thumbnail delivered as [text/html] instead of [image/png] and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3150533 (10Urbanecm) https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Ambox_currentevent.s... [13:18:46] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345964 (https://phabricator.wikimedia.org/T161946) (owner: 10Urbanecm) [13:19:07] lets do Urbanecm patches [13:19:14] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346057 (https://phabricator.wikimedia.org/T161999) (owner: 10Urbanecm) [13:19:25] zeljkof, Dereckson, hashar: May somebody ban https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Ambox_currentevent.svg/48px-Ambox_currentevent.svg.png from esams cache? T162035 This is template-icon template which is frequently used. [13:19:26] T162035: Specific PNG thumbnail delivered as [text/html] instead of [image/png] and hence not rendered in browser - https://phabricator.wikimedia.org/T162035 [13:19:52] Urbanecm: the thumbs have an issue right now [13:19:57] (03Merged) 10jenkins-bot: Add NS100 (Portal) to ladwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345647 (https://phabricator.wikimedia.org/T161843) (owner: 10Urbanecm) [13:20:06] (03Merged) 10jenkins-bot: Add rollback user group in fawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345964 (https://phabricator.wikimedia.org/T161946) (owner: 10Urbanecm) [13:20:09] 06Operations, 10Traffic, 10media-storage, 15User-Urbanecm: Specific PNG thumbnail delivered as [text/html] instead of [image/png] and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3150536 (10ema) Note that the issue is pretty widespread, I'm seeing lots of requests affected by t... [13:20:12] (03CR) 10jenkins-bot: Add NS100 (Portal) to ladwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345647 (https://phabricator.wikimedia.org/T161843) (owner: 10Urbanecm) [13:20:21] hashar: should I continue with swat? [13:20:29] hashar, when I can expect fixing? And by what is the problem caused? [13:20:32] (03Merged) 10jenkins-bot: Optimalize all not-optimalized logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346057 (https://phabricator.wikimedia.org/T161999) (owner: 10Urbanecm) [13:20:42] Urbanecm: I have no idea I am not looking into it [13:20:52] zeljkof: I CR+2 all three patches, pushing them to mwdebug1001 [13:20:54] hashar, ok. [13:21:11] hashar: ok, so you are taking over swat then? [13:21:31] not really [13:21:51] zeljkof: guess you can baby sit https://gerrit.wikimedia.org/r/#/c/346058/ for Dereckson :} [13:22:08] (03CR) 10jenkins-bot: Add rollback user group in fawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345964 (https://phabricator.wikimedia.org/T161946) (owner: 10Urbanecm) [13:22:53] hashar: sure, should I deploy it now, or will you let me know when you are done' [13:22:55] ? [13:23:02] deploy it [13:23:05] that can be done in parallel [13:23:15] PROBLEM - puppet last run on db1081 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:23:17] yes, looks like completely unrelated file [13:23:22] Urbanecm: so your 3 patches are on mwdebug1001 [13:23:34] Urbanecm: will do the namespace check for ladwiki [13:23:35] hashar, okay, I'll test them [13:23:42] hashar, ok [13:23:54] hashar: ok, merging and deploying [13:24:09] Dereckson: can 346058 be tested at mwdebug1002? [13:25:49] (once it is there) [13:25:51] yes I think [13:26:03] ok, will ping you in a few minutes [13:26:11] if not, that would mean a full scap is required like for other l10n changes [13:26:24] hashar, working [13:27:12] !log terbium: scap pull for ladwiki namespace additions [13:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:42] Urbanecm: syncing. Though I am holding the static/images/project-logos/*.png thing for now [13:29:51] hashar, okay [13:29:54] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: Add NS100 (Portal) to ladwiki, Add rollback user group in fawikisource (duration: 00m 47s) [13:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:04] Dereckson: the patch is at mwdebug1002 [13:30:08] please test [13:33:29] hi MatmaRex [13:33:30] zeljkof: testing [13:33:42] (03PS1) 10Gehel: relforge - add LVS entry [puppet] - 10https://gerrit.wikimedia.org/r/346148 (https://phabricator.wikimedia.org/T162037) [13:33:57] 06Operations, 07HHVM: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ) - https://phabricator.wikimedia.org/T161684#3150576 (10MoritzMuehlenhoff) After further debugging I now believe that stat_cache is still broken in 3.18 and causing these deadlocks. To confirm, I've d... [13:34:00] hi [13:34:04] did i break something? [13:34:50] <_joe_> MatmaRex: how exactly? [13:35:39] i got pinged on this channel, so i assume i must've broken something :D [13:36:05] (03PS1) 10Volans: Revert "Swift-proxy: use discovery everywhere for rewrites" [puppet] - 10https://gerrit.wikimedia.org/r/346149 [13:36:08] (just joking) [13:36:22] (03PS2) 10Volans: Revert "Swift-proxy: use discovery everywhere for rewrites" [puppet] - 10https://gerrit.wikimedia.org/r/346149 [13:37:36] 06Operations, 06Analytics-Kanban, 06WMDE-Analytics-Engineering, 13Patch-For-Review, 15User-Addshore: /a/mw-log/archive/api on stat1002 no longer being populated - https://phabricator.wikimedia.org/T160888#3150579 (10Ottomata) > we might want to open another one to track down how to move away api-logs fro... [13:37:36] zeljkof: working [13:37:47] Dereckson: ok, deploying to cluster then [13:38:40] !log zfilipin@tin Synchronized php-1.29.0-wmf.18/extensions/cldr/: SWAT: [[gerrit:346058|Translate Atikamekw language name in French]] (duration: 00m 51s) [13:38:43] thanks [13:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:52] Dereckson: deployed, please test on production [13:39:09] hashar: is everything deployed now? can we close the eu swat? [13:39:18] 06Operations, 05Goal, 07kubernetes: Prepare to service applications from kubernetes - https://phabricator.wikimedia.org/T162039#3150583 (10akosiaris) [13:39:18] works also in prod [13:39:22] Urbanecm: will have to sync the static/images/project-logos later on. There is a thumb issue on going [13:39:23] so no need for full scap, good news [13:39:29] Urbanecm: I am watching it though [13:40:02] hashar, ack [13:40:08] (03PS6) 10Matthias Mullie: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [13:41:08] hashar: ping me if there is anything else to deploy, I'm around but doing other stuff, looks like all patches for today are deployed [13:41:31] 06Operations, 05Goal, 07kubernetes: Eliminate SPOFs in the existing eqiad infrastructure - https://phabricator.wikimedia.org/T162040#3150598 (10akosiaris) [13:42:19] zeljkof: yeah all done but one [13:42:35] which I will take care of once the thumb/png issue is resolved [13:44:37] 06Operations, 05Goal, 07kubernetes: Expand the infrastructure to codfw - https://phabricator.wikimedia.org/T162041#3150620 (10akosiaris) [13:47:45] 06Operations: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029#3150639 (10MoritzMuehlenhoff) p:05Triage>03Normal [13:49:42] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [13:50:12] 06Operations, 13Patch-For-Review: Package the next LTS kernel (4.9) - https://phabricator.wikimedia.org/T154934#3150651 (10MoritzMuehlenhoff) 05Open>03Resolved The new kernel is available on apt.wikimedia.org and is used by default on jessie installations. Closing, the migration of existing jessie installa... [13:50:33] 06Operations, 06Release-Engineering-Team, 06Services, 05Goal, 07kubernetes: Prepare and maintain base container images - https://phabricator.wikimedia.org/T162042#3150656 (10akosiaris) [13:51:22] RECOVERY - puppet last run on db1081 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [13:51:50] 06Operations, 05Goal, 07kubernetes: Define a process to keep images up-to-date on similar standards as the rest of production - https://phabricator.wikimedia.org/T162043#3150672 (10akosiaris) [13:51:54] cmjohnson1: hi! do you have a minute? [13:52:10] Hi elukey sure [13:52:53] thanks! I tried to reimage analytics1030.eqiad.wmnet and for some reason it is now stuck while booting, and powercycle/hardreset does not work [13:53:22] okay, you tried via mgmt? [13:54:40] 06Operations, 05Goal, 07kubernetes: Design and implement a Kubernetes-based staging environment. (stretch) - https://phabricator.wikimedia.org/T162045#3150713 (10akosiaris) [13:54:42] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [13:56:48] 06Operations, 10hardware-requests: EQIAD: (4) hardware access request for ganeti - https://phabricator.wikimedia.org/T161702#3150731 (10akosiaris) [13:56:51] 06Operations, 05Goal, 07kubernetes: Eliminate SPOFs in the existing eqiad infrastructure - https://phabricator.wikimedia.org/T162040#3150730 (10akosiaris) [13:57:27] 06Operations, 10hardware-requests: CODFW: (4) hardware access request for kubernetes - https://phabricator.wikimedia.org/T161700#3150733 (10akosiaris) [13:57:30] 06Operations, 05Goal, 07kubernetes: Expand the infrastructure to codfw - https://phabricator.wikimedia.org/T162041#3150732 (10akosiaris) [13:58:40] zeljkof: so swat not done yet then? [13:59:13] marostegui: I think hashar is waiting for something to be resolved before pushing one last thing [13:59:23] but swat time is up anyway in a minute [14:01:39] 06Operations, 10netops: asw-a1-codfw spontaneous reboot - https://phabricator.wikimedia.org/T159464#3150744 (10faidon) 05Open>03Resolved a:03faidon Logs didn't show anything and it hasn't happened in a month. Let's resolve for now. [14:01:58] zeljkof: sure, I can wait until you guys are fine with it :) [14:02:38] marostegui: I think our time is up, if you have deployment scheduled, wait a few minutes if hashar replies, if not, go ahead [14:02:53] marostegui: yeah it is done [14:03:07] well not completely, still have to sync some project logos but that is not an issue [14:03:12] Ah, great! Thanks guys :) [14:03:20] Seriously, I can wait, if you guys prefer me to wait [14:03:40] jouncebot: now [14:03:40] No deployments scheduled for the next 2 hour(s) and 56 minute(s) [14:07:15] (03PS3) 10Marostegui: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346144 [14:08:53] 06Operations, 10ops-eqiad, 06Analytics-Kanban, 06DC-Ops: analytics1030 stuck in console while booting - https://phabricator.wikimedia.org/T162046#3150761 (10elukey) [14:09:13] marostegui: you can deploy just fine :-} [14:09:44] the only thing I am holding is https://gerrit.wikimedia.org/r/#/c/346057/ which updates a few images in static/images/project-logos/ [14:10:00] 06Operations, 06Analytics-Kanban, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3150775 (10elukey) Analytics1030 is refusing to boot, opened a phab task: https://phabricator.wikimedia.org/T162046 [14:12:30] hashar: ok! thank you :) [14:13:10] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346144 (owner: 10Marostegui) [14:14:43] (03PS1) 10Ema: cache_upload: unset Content-Type on 304 responses [puppet] - 10https://gerrit.wikimedia.org/r/346157 (https://phabricator.wikimedia.org/T162035) [14:15:41] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346144 (owner: 10Marostegui) [14:17:05] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1086 - T160390 (duration: 00m 51s) [14:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:11] T160390: Unify revision table on s7 - https://phabricator.wikimedia.org/T160390 [14:17:50] (03CR) 10BBlack: [C: 031] cache_upload: unset Content-Type on 304 responses [puppet] - 10https://gerrit.wikimedia.org/r/346157 (https://phabricator.wikimedia.org/T162035) (owner: 10Ema) [14:17:52] (03PS3) 10Giuseppe Lavagetto: base::puppet: add puppet helper scripts [puppet] - 10https://gerrit.wikimedia.org/r/346118 [14:18:51] (03PS2) 10Alexandros Kosiaris: elasticsearch: Fix ERB instance variable notation [puppet] - 10https://gerrit.wikimedia.org/r/345845 [14:18:56] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] elasticsearch: Fix ERB instance variable notation [puppet] - 10https://gerrit.wikimedia.org/r/345845 (owner: 10Alexandros Kosiaris) [14:20:01] (03PS2) 10Ema: cache_upload: unset Content-Type on 304 responses [puppet] - 10https://gerrit.wikimedia.org/r/346157 (https://phabricator.wikimedia.org/T162035) [14:20:02] PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:20:14] (03CR) 10Ema: [V: 032 C: 032] cache_upload: unset Content-Type on 304 responses [puppet] - 10https://gerrit.wikimedia.org/r/346157 (https://phabricator.wikimedia.org/T162035) (owner: 10Ema) [14:20:29] (03PS1) 10ArielGlenn: fix bug introduced in local variable cleanup for recombine jobs [dumps] - 10https://gerrit.wikimedia.org/r/346158 [14:22:41] 06Operations, 06Release-Engineering-Team, 05Goal, 06Services (designing), and 2 others: Prepare and maintain base container images - https://phabricator.wikimedia.org/T162042#3150814 (10mobrovac) [14:23:44] (03PS1) 10Marostegui: db-eqiad.php: Depool db1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346160 (https://phabricator.wikimedia.org/T159319) [14:25:18] (03CR) 10ArielGlenn: [C: 032] fix bug introduced in local variable cleanup for recombine jobs [dumps] - 10https://gerrit.wikimedia.org/r/346158 (owner: 10ArielGlenn) [14:26:24] 06Operations, 10Traffic: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#3150818 (10BBlack) Depending on the context I've been flipping between whether we're talking about just 3DES or both of the non-FS ciphers, sorry. In current weekly stat... [14:26:28] !log ariel@tin Started deploy [dumps/dumps@905a845]: fix stub recombines, broken by too agressive 'cleanup' of local vars [14:26:30] !log ariel@tin Finished deploy [dumps/dumps@905a845]: fix stub recombines, broken by too agressive 'cleanup' of local vars (duration: 00m 02s) [14:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:42] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:26:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346160 (https://phabricator.wikimedia.org/T159319) (owner: 10Marostegui) [14:28:21] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346160 (https://phabricator.wikimedia.org/T159319) (owner: 10Marostegui) [14:29:53] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1015 - T159319 (duration: 00m 44s) [14:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:42] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:33:20] (03PS1) 10Hoo man: Don't set removed Wikibase client settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346161 [14:33:28] (03PS1) 10Muehlenhoff: Set wireshark-common in debconf to avoid setuid prompt [puppet] - 10https://gerrit.wikimedia.org/r/346162 [14:33:40] (03PS2) 10Muehlenhoff: Set wireshark-common in debconf to avoid setuid prompt [puppet] - 10https://gerrit.wikimedia.org/r/346162 [14:38:33] 06Operations, 10ops-codfw: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3150855 (10Papaul) a:05Papaul>03RobH Flash Drive in place [14:45:33] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3150869 (10faidon) a:05Cmjohnson>03Dzahn Someone, unfortunately, needs to follow the process outlined here: https://wikitech.wikimedia.org/wiki/OCG#Decommissioning_a_host @Dzahn, can I ask you to have a look at... [14:48:02] RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [14:49:32] !log cache_upload: ban all objects with content-type: text/html T162035 [14:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:39] T162035: Specific PNG thumbnail delivered as [text/html] instead of [image/png] and hence not rendered in browser - https://phabricator.wikimedia.org/T162035 [14:50:02] 06Operations, 07Puppet, 06Discovery, 06Maps, 03Interactive-Sprint: Puppet fails with "Could not find init script for 'postgresql@9.4-main'" on maps / labs server - https://phabricator.wikimedia.org/T161893#3150878 (10Gehel) The simplest possible test I can think of is `sudo puppet apply -e "service {'pos... [14:54:50] !log Deploy alter table to unify revision table across all the s3 wikis on db1015 - T159319 [14:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:42] ema, are you sure they were baned? I still can download some. [14:56:27] Urbanecm: in progress :) [14:56:28] it takes a little bit to execute the ban on all the affected hosts, I'm not sure if he's done yet [14:58:41] 06Operations, 06Office-IT, 07LDAP: Remove disabled users from internal mailing lists - https://phabricator.wikimedia.org/T161004#3150898 (10Aklapper) Please also check the discussion in {T100400}. [15:00:03] Urbanecm: done [15:02:05] (03Abandoned) 10Volans: Revert "Swift-proxy: use discovery everywhere for rewrites" [puppet] - 10https://gerrit.wikimedia.org/r/346149 (owner: 10Volans) [15:05:08] (03PS1) 10Hashar: contint: PHP packages cleanup [puppet] - 10https://gerrit.wikimedia.org/r/346165 [15:05:32] ema, it doesn't seems be done. For example https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Ambox_currentevent.svg/48px-Ambox_currentevent.svg.png is still bad. [15:05:51] works for me. [15:06:02] paladox, which cluster do you use? [15:06:15] What do you mean cluster? [15:06:16] esama [15:06:18] *esams [15:06:23] yeh, europe [15:06:37] Urbanecm: that WFM from europe too [15:06:45] Thank you both. [15:06:45] Urbanecm: what's the value of the X-Cache response header you get with the bad response? [15:07:00] ema, the bad response isn't send anymore, sorry [15:07:09] It works now... [15:07:10] \o/ [15:07:17] (03CR) 10Hashar: "That follow up https://gerrit.wikimedia.org/r/#/c/325877/ . A bit nicer since mediawiki::packages::php5 is now included explicitly. Onc" [puppet] - 10https://gerrit.wikimedia.org/r/346165 (owner: 10Hashar) [15:14:02] PROBLEM - Check Varnish expiry mailbox lag on cp1064 is CRITICAL: CRITICAL: expiry mailbox lag is 791658 [15:17:33] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 1008349 [15:24:12] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:25:12] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:25:52] PROBLEM - Check Varnish expiry mailbox lag on cp1050 is CRITICAL: CRITICAL: expiry mailbox lag is 564916 [15:27:54] 503s in upload ulsfo [15:28:02] PROBLEM - Check Varnish expiry mailbox lag on cp1062 is CRITICAL: CRITICAL: expiry mailbox lag is 733803 [15:28:55] hmmm [15:29:22] PROBLEM - Check Varnish expiry mailbox lag on cp1073 is CRITICAL: CRITICAL: expiry mailbox lag is 766144 [15:29:32] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp4015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:30:15] related to the ban or the change? [15:30:32] I don't think so [15:30:47] why the slew of mailbox issues on cp1? [15:31:13] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:31:15] seems likely related to the execution of the ban at least (perhaps that stalls out reaping space, or affects the pattern of it) [15:32:29] Varnish HTTP upload-backend - port 3128 on cp4015 is CRITICAL <- this seems a false positive? port 3128 is fine there [15:32:46] it may be intermittent [15:32:55] or we may have a ulsfo<->codfw/eqiad networking issue [15:32:55] <_joe_> ema: it's a timeout on a request [15:33:12] (which could cause that icinga port issue and the 5xx) [15:33:32] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp4013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:34:12] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [15:34:32] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp4015 is OK: HTTP OK: HTTP/1.1 200 OK - 181 bytes in 8.971 second response time [15:34:32] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp4014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:35:23] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp4014 is OK: HTTP OK: HTTP/1.1 200 OK - 180 bytes in 2.034 second response time [15:36:32] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp4013 is OK: HTTP OK: HTTP/1.1 200 OK - 180 bytes in 2.416 second response time [15:37:49] network graphs looked reasonable in librenms [15:38:04] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Specific PNG thumbnail delivered as [text/html] instead of [image/png] and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3151086 (10stjn) Another one to the pile: https://upload.wikimedia.org/wikipe... [15:38:09] but all the cp1 mailboxes backlogging at the same time is suspect. maybe somehow that caused some indirect fallout [15:38:38] the 5xx rate is tapering off from initial peak, but still not back to normal [15:38:48] the lagging cp1s are not throwing 503s though [15:39:16] yeah but ulsfo's backend requests on misses ultimately end up requesting from those same daemons/storage [15:39:35] 06Operations, 07Puppet, 06Discovery, 06Maps, 03Interactive-Sprint: Puppet fails with "Could not find init script for 'postgresql@9.4-main'" on maps / labs server - https://phabricator.wikimedia.org/T161893#3151105 (10Gehel) It looks like puppet is using `systemctl('list-unit-files', '--type', 'service',... [15:40:03] bblack: we should see those requests with varnishlog on the cp1* machines tho, right? [15:40:11] yeah [15:40:15] varnishlog -q 'RespStatus ~ 503' gives no output [15:40:40] maybe they're just slow and ulsfo is giving up? [15:41:09] mmh [15:41:12] the 503 rate is fairly small [15:41:55] the initial peak was 16.6/s with GETs at ~9k/s [15:43:04] !log Updated email for "Lucie Kaffee" on wikitech from work address (wikimedia.de) to known volunteer address (upon request) [15:43:08] 6.2% of reqs should leave ulsfo towards further-back [15:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:52] (03CR) 10Alexandros Kosiaris: [C: 032] package_builder: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345836 (owner: 10Faidon Liambotis) [15:43:58] (03PS2) 10Alexandros Kosiaris: package_builder: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345836 (owner: 10Faidon Liambotis) [15:43:59] so very napkin-math estimate is that even the 16.6/s peak of 503s there on ulsfo only represented ~3% of the ulsfo->[codfw,eqiad,app] traffic failing [15:44:02] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] package_builder: remove precise support [puppet] - 10https://gerrit.wikimedia.org/r/345836 (owner: 10Faidon Liambotis) [15:45:32] bblack: cp4015 has a fairly large mailbox lag [15:46:51] -- FetchError http first read error: EOF [15:47:06] not too frequently, but they're happening ^ [15:48:17] yeah so if mailbox lag is generally getting worse than it has been, my two top hypotheses would be either the ttl/keep change affecting it negatively (I would've assumed positively, but who knows really) or the ban [15:48:40] but given the timing and functional correlation, I'd suspect the ban has some impact on the storage purging process [15:48:42] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [15:49:04] ripe-atlas-ulsfo perhaps related? [15:49:10] or maybe we are saturating network somehow? [15:50:10] (03PS4) 10Andrew Bogott: Keystone 2fa: Use the wikitech API rather than checking the db directly. [puppet] - 10https://gerrit.wikimedia.org/r/345231 [15:50:29] varnish-backend hitrate went down significantly [15:50:30] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=15&fullscreen&orgId=1&var-server=cp4015&var-datasource=ulsfo%20prometheus%2Fops [15:51:10] maybe unsetting Content-Type has side effects? [15:51:35] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Specific PNG thumbnail delivered as [text/html] instead of [image/png] and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3150362 (10Joe) @stjn I cannot reproduce your case and we should've fixed the... [15:51:44] (03CR) 10Andrew Bogott: [C: 032] Keystone 2fa: Use the wikitech API rather than checking the db directly. [puppet] - 10https://gerrit.wikimedia.org/r/345231 (owner: 10Andrew Bogott) [15:53:39] (03CR) 10Alexandros Kosiaris: "just did the manual cleanup on copper" [puppet] - 10https://gerrit.wikimedia.org/r/345836 (owner: 10Faidon Liambotis) [15:53:42] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [15:54:22] ema: it could also be that there's no functionally-bad side-effect, but that the affected (temporary bad html content type) objects were more numerous than we thought, and this purged a whole lot of things [15:54:53] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Specific PNG thumbnail delivered as [text/html] instead of [image/png] and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3151162 (10Urbanecm) No problem for me. Try to clear cache of your browser. T... [15:54:54] (03CR) 10Alexandros Kosiaris: [C: 032] url_downloader: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344729 (owner: 10Dzahn) [15:55:13] bblack: the number of cached objects in varnish-be didn't change much so I'd say that's not the case [15:55:23] (03CR) 10Alexandros Kosiaris: [C: 032] "damn, needs manual rebase" [puppet] - 10https://gerrit.wikimedia.org/r/344729 (owner: 10Dzahn) [15:55:43] ok [15:56:30] but the lagging mailbox issue triggers additional backend connections so that might explain the additional network traffic [15:57:02] maybe check on whether the bans are still running anywhere or already done? (e.g. cumin "varnishadm ban.list") [15:57:21] and yeah maybe try reverting the CT unset [15:58:06] I'd think once they're done executing (scanning storage to purge objects) their impact would go away, if any [15:58:57] there's stuff like: [15:58:58] Present bans: [15:59:01] 1491231520.525840 4320833 C [15:59:06] but the CT bans are gone [15:59:17] pl [15:59:21] err, "ok" [15:59:30] I think there's always a base entry in there with C [15:59:56] on some hosts there are multiple C entries, not sure what that means [15:59:59] whereas e.g. when I looked earlier (as you were executing them initially), some caches would show: [16:00:02] 1491231520.153275 3433299 - obj.http.content-type == text/html; charset=UTF-8 [16:00:04] 1491231100.964850 848910 C [16:00:07] 1490739431.603784 0 C [16:01:16] (03PS7) 10Alexandros Kosiaris: url_downloader: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344729 (owner: 10Dzahn) [16:03:05] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Specific PNG thumbnail delivered as [text/html] instead of [image/png] and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3151195 (10stjn) Yes, the same ERR_CONTENT_DECODING_FAILED even with disabled... [16:04:45] (03PS1) 10Andrew Bogott: keystone.conf: Define labs_osm_host [puppet] - 10https://gerrit.wikimedia.org/r/346171 [16:06:37] cmjohnson1: any update on those new hadoop servers? [16:06:44] (03CR) 10Andrew Bogott: [C: 032] keystone.conf: Define labs_osm_host [puppet] - 10https://gerrit.wikimedia.org/r/346171 (owner: 10Andrew Bogott) [16:06:52] (03CR) 10Alexandros Kosiaris: [C: 031] "https://puppet-compiler.wmflabs.org/6006/ says NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/344729 (owner: 10Dzahn) [16:06:52] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:07:13] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:07:48] ulsfo 5xx just dropped back to zero now [16:07:54] yeah [16:07:59] odd! :) [16:09:05] hitrate up again https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=15&fullscreen&orgId=1&var-server=cp4015&var-datasource=ulsfo%20prometheus%2Fops [16:09:12] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:09:52] ottomata: there still in the boxes...barring other misc tasks that take me away from them they're my primary focus this week. Do you have preferred racking instructions? [16:14:02] RECOVERY - Check Varnish expiry mailbox lag on cp1064 is OK: OK: expiry mailbox lag is 0 [16:15:12] PROBLEM - Check Varnish expiry mailbox lag on cp4015 is CRITICAL: CRITICAL: expiry mailbox lag is 567254 [16:16:03] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban, 15User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#3151240 (10Nuria) [16:18:02] PROBLEM - Check Varnish expiry mailbox lag on cp4014 is CRITICAL: CRITICAL: expiry mailbox lag is 521563 [16:18:23] bblack: and now cp4* mailboxes lagging, fun! [16:19:07] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Specific PNG thumbnail delivered as [text/html] instead of [image/png] and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3151248 (10Vachovec1) I can confirm ERR_CONTENT_DECODING_FAILED for https://u... [16:20:12] PROBLEM - puppet last run on cp1099 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:22:08] cmjohnson1: other than evenly distributed in different rows [16:22:08] nope [16:22:10] thank you! [16:22:40] okay..that works [16:26:17] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: 6 Nodes for Kafka refresh/upgrade - https://phabricator.wikimedia.org/T161636#3151276 (10Nuria) p:05Triage>03Normal [16:28:37] 06Operations, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3151306 (10elukey) The issue with QUIT seems more subtle, namely only sometimes the RST happens after a QUIT... [16:28:43] (03PS1) 10Giuseppe Lavagetto: mediawiki::cron: general encapsulation for mediawiki cronjobs [puppet] - 10https://gerrit.wikimedia.org/r/346173 [16:30:57] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::cron: general encapsulation for mediawiki cronjobs [puppet] - 10https://gerrit.wikimedia.org/r/346173 (owner: 10Giuseppe Lavagetto) [16:32:22] PROBLEM - puppet last run on analytics1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:34:52] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [16:38:16] 06Operations, 10Analytics-EventLogging, 06Analytics-Kanban, 10DBA, 13Patch-For-Review: Improve eventlogging replication procedure - https://phabricator.wikimedia.org/T124307#3151383 (10Nuria) [16:43:53] 06Operations, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3151420 (10elukey) Returning to the main timeout issue, it seems to me that the next step is trying to find... [16:47:33] PROBLEM - Check Varnish expiry mailbox lag on cp4013 is CRITICAL: CRITICAL: expiry mailbox lag is 552977 [16:48:12] RECOVERY - puppet last run on cp1099 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [16:52:02] PROBLEM - Host analytics1030 is DOWN: PING CRITICAL - Packet loss = 100% [16:52:12] PROBLEM - salt-minion processes on analytics1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [16:52:22] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 49614.215709 Seconds [16:52:22] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 49615.110857 Seconds [16:52:22] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 49615.966464 Seconds [16:53:13] 06Operations: Weak digest algorithm (SHA1) used to sign InRelease on apt.wikimedia.org - https://phabricator.wikimedia.org/T132325#2194482 (10faidon) First off, I'm surprised that sid's apt worked with the jessie-wikimedia suite, since jessie-wikimedia is signed with a weak DSA key that shouldn't be accepted by... [16:55:22] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [16:55:22] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [16:55:22] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 0.0 Seconds [16:55:42] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [16:57:21] 06Operations, 06DC-Ops, 06Labs: Move labstore1002 and labstore1002-array1 and labstore1002-array2 to different rack (currently in C3) - https://phabricator.wikimedia.org/T158913#3151459 (10madhuvishy) Hi @Cmjohnson, apologies for the delay here, we were working through the possibilities of what the next step... [16:57:23] setting downtime for an1030 [16:58:50] elukey: powering it back on [16:59:37] ah nice cmjohnson1, did you manage to fix it? [17:00:05] gehel: Dear anthropoid, the time has come. Please deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170403T1700). [17:00:12] PROBLEM - Check Varnish expiry mailbox lag on cp4006 is CRITICAL: CRITICAL: expiry mailbox lag is 558262 [17:00:42] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [17:01:22] RECOVERY - puppet last run on analytics1050 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [17:01:24] !log gehel@tin Started deploy [wdqs/wdqs@d7c367a]: (no justification provided) [17:01:30] (03PS2) 10Madhuvishy: tools: Deprecate precise_reminder role and clean up related script [puppet] - 10https://gerrit.wikimedia.org/r/342658 (https://phabricator.wikimedia.org/T149214) [17:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:54] !log gehel@tin Finished deploy [wdqs/wdqs@d7c367a]: (no justification provided) (duration: 01m 29s) [17:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:43] elukey: it looks like there is going to be a problem that will required Dell... [17:03:55] SMalyshev: wdqs deployment completed, tests looking good... [17:04:09] (03Abandoned) 10Madhuvishy: tools: Deprecate precise_reminder role and clean up related script [puppet] - 10https://gerrit.wikimedia.org/r/342658 (https://phabricator.wikimedia.org/T149214) (owner: 10Madhuvishy) [17:06:06] cmjohnson1: ack, thanks :) [17:06:07] (03PS2) 10Giuseppe Lavagetto: mediawiki::cron: general encapsulation for mediawiki cronjobs [puppet] - 10https://gerrit.wikimedia.org/r/346173 [17:07:04] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::cron: general encapsulation for mediawiki cronjobs [puppet] - 10https://gerrit.wikimedia.org/r/346173 (owner: 10Giuseppe Lavagetto) [17:07:08] (03PS1) 10Madhuvishy: nfsclient: Enable lookupcache by default for all nfs client instances [puppet] - 10https://gerrit.wikimedia.org/r/346177 (https://phabricator.wikimedia.org/T136712) [17:07:23] 06Operations, 10DBA: Cleanup or remove mysql puppet module; repurpose mariadb module to cover misc use cases - https://phabricator.wikimedia.org/T162070#3151523 (10faidon) [17:08:15] gehel: great, thanks! [17:08:24] SMalyshev: you're welcomed! [17:16:22] PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:27:05] 06Operations, 07HHVM: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ) - https://phabricator.wikimedia.org/T161684#3151575 (10MoritzMuehlenhoff) mw1261 is stable with stat_cache=false for ten hours of production traffic now. I've reported this back to upstream along wit... [17:27:44] (03Draft1) 10Paladox: Gerrit: Disable md5 in ssh [puppet] - 10https://gerrit.wikimedia.org/r/346180 [17:27:59] (03PS2) 10Paladox: Gerrit: Disable md5 in ssh [puppet] - 10https://gerrit.wikimedia.org/r/346180 [17:28:15] mutante ^^ :) [17:29:23] (03CR) 10Paladox: "I have no idea what eddsa is called in mac." [puppet] - 10https://gerrit.wikimedia.org/r/346180 (owner: 10Paladox) [17:30:12] RECOVERY - Check Varnish expiry mailbox lag on cp4006 is OK: OK: expiry mailbox lag is 0 [17:32:49] (03CR) 10Dzahn: "disabling MD5 is good. is it enabled now though? the eddsa host key thing should be unrelated to the MAC choice" [puppet] - 10https://gerrit.wikimedia.org/r/346180 (owner: 10Paladox) [17:33:47] (03CR) 10Paladox: "> disabling MD5 is good. is it enabled now though? the eddsa host key" [puppet] - 10https://gerrit.wikimedia.org/r/346180 (owner: 10Paladox) [17:38:52] (03CR) 10Paladox: "Here is the output of ssh -vvv to gerrit.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/346180 (owner: 10Paladox) [17:39:16] (03PS3) 10Giuseppe Lavagetto: mediawiki::cron: general encapsulation for mediawiki cronjobs [puppet] - 10https://gerrit.wikimedia.org/r/346173 [17:40:08] (03CR) 10Thcipriani: [C: 031] "So the error output in the puppet compiler seems to be complaining about conftool. I don't think this is being caused by this change since" [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [17:44:22] RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [17:56:04] 06Operations, 10Ops-Access-Requests: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3151609 (10BBlack) [17:56:58] hey, that's me! [17:57:21] (03PS1) 10BBlack: Add ayounsi shell account in ops [puppet] - 10https://gerrit.wikimedia.org/r/346182 (https://phabricator.wikimedia.org/T162073) [17:57:32] RECOVERY - Check Varnish expiry mailbox lag on cp4013 is OK: OK: expiry mailbox lag is 40 [17:57:42] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [17:57:57] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "good work, see a few inline comments." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/342248 (https://phabricator.wikimedia.org/T147718) (owner: 10Gehel) [17:58:02] RECOVERY - Check Varnish expiry mailbox lag on cp4014 is OK: OK: expiry mailbox lag is 34808 [17:58:08] 06Operations, 10Ops-Access-Requests, 10Traffic: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3151626 (10BBlack) [17:58:27] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/346182 (https://phabricator.wikimedia.org/T162073) (owner: 10BBlack) [17:58:44] (03CR) 10BBlack: [C: 032] Add ayounsi shell account in ops [puppet] - 10https://gerrit.wikimedia.org/r/346182 (https://phabricator.wikimedia.org/T162073) (owner: 10BBlack) [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170403T1800). Please do the needful. [18:00:05] DatGuy: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:01:07] 06Operations, 10Ops-Access-Requests, 10Traffic: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3151609 (10Dzahn) added to ops mailing list [18:02:12] Hello, I can SWAT. [18:02:15] DatGuy: ping? [18:02:19] 06Operations, 10Ops-Access-Requests, 10Traffic: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3151609 (10faidon) Added to all 34 network devices (cr, asw/csw, msw, mr, pfw). [18:02:42] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [18:03:26] 06Operations, 10Ops-Access-Requests, 10Traffic: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3151651 (10BBlack) [18:03:35] 06Operations, 10Ops-Access-Requests, 10Traffic: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3151609 (10BBlack) [18:05:13] RECOVERY - Check Varnish expiry mailbox lag on cp4015 is OK: OK: expiry mailbox lag is 183 [18:06:29] (03PS1) 10Dzahn: ssh: avoid hardcoded hostname for yubiauth, add to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/346183 [18:07:35] (03CR) 10jerkins-bot: [V: 04-1] ssh: avoid hardcoded hostname for yubiauth, add to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/346183 (owner: 10Dzahn) [18:07:46] (03CR) 10Dzahn: [C: 04-1] ssh: avoid hardcoded hostname for yubiauth, add to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/346183 (owner: 10Dzahn) [18:07:48] 06Operations, 10Ops-Access-Requests, 10Traffic: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3151661 (10BBlack) [18:08:16] (03CR) 10Dzahn: [C: 04-2] ssh: avoid hardcoded hostname for yubiauth, add to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/346183 (owner: 10Dzahn) [18:09:03] (03PS1) 10Dzahn: icinga: allow command execution for Ayounsi [puppet] - 10https://gerrit.wikimedia.org/r/346184 (https://phabricator.wikimedia.org/T162073) [18:09:58] (03CR) 10Dereckson: "Initially planned for 2017-04-03 morning SWAT, but change author weren't available." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346043 (https://phabricator.wikimedia.org/T161804) (owner: 10DatGuy) [18:10:52] (03CR) 10Dereckson: "Initially planned for 2017-04-03 morning SWAT, but change author weren't available." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346044 (https://phabricator.wikimedia.org/T161593) (owner: 10DatGuy) [18:13:25] 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3151686 (10Dzahn) added to root@ mail alias (prepare for incoming wave of mails :p) [18:14:25] ah crikey [18:14:30] Dereckson, thought it was tomorrow [18:15:10] is it too late? [18:15:18] No, we can't deploy them now :) [18:15:25] alright [18:15:31] Any preference order? [18:16:15] you mean order of merges? [18:17:18] yes [18:17:21] not really [18:17:25] (03PS3) 10Dereckson: Convert reference lists to 'responsive' on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346043 (https://phabricator.wikimedia.org/T161804) (owner: 10DatGuy) [18:17:32] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346043 (https://phabricator.wikimedia.org/T161804) (owner: 10DatGuy) [18:18:36] DatGuy: you've already the X-Wikimedia-Debug extension installed? [18:18:41] yep [18:18:44] it's mw1002 right? [18:18:53] mwdebug1002 indeed [18:19:05] (but we're still waiting zuul to find a free slot to run tests) [18:20:33] 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3151689 (10Dzahn) added Icinga contact in private repo (with just email notification method for now, no phone number / paging just yet) [18:22:04] (03CR) 10Dzahn: [C: 031] "contact exists now (private repo)" [puppet] - 10https://gerrit.wikimedia.org/r/346184 (https://phabricator.wikimedia.org/T162073) (owner: 10Dzahn) [18:22:58] 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3151694 (10BBlack) Added to other email aliases in private repo as well: dns-admin, peering, ripe-updates [18:23:31] (03Merged) 10jenkins-bot: Convert reference lists to 'responsive' on hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346043 (https://phabricator.wikimedia.org/T161804) (owner: 10DatGuy) [18:23:40] mutante: hold that commit for icinga cmd access, can test his +2 on it? [18:23:54] alright, checking now [18:24:21] DatGuy: not yet on mwdebug1002 (but will in 20 seconds) [18:24:39] DatGuy: ok, live now [18:24:40] bblack: sure, good ida [18:25:23] (03PS2) 10Dereckson: Configure Babel for elwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346044 (https://phabricator.wikimedia.org/T161593) (owner: 10DatGuy) [18:25:30] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346044 (https://phabricator.wikimedia.org/T161593) (owner: 10DatGuy) [18:27:05] hewiki looks good, but only checked one page [18:27:33] a page with references? [18:27:56] (03Merged) 10jenkins-bot: Configure Babel for elwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346044 (https://phabricator.wikimedia.org/T161593) (owner: 10DatGuy) [18:28:04] yep [18:28:26] So that's good :) [18:28:39] Syncing [18:29:14] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Convert reference lists to 'responsive' on hewiki (T161804) (duration: 00m 52s) [18:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:21] T161804: Convert reference lists over to `responsive` on hewiki - https://phabricator.wikimedia.org/T161804 [18:29:36] (03CR) 10Ayounsi: [C: 032] icinga: allow command execution for Ayounsi [puppet] - 10https://gerrit.wikimedia.org/r/346184 (https://phabricator.wikimedia.org/T162073) (owner: 10Dzahn) [18:30:32] DatGuy: Babel change live on mwdebug1002.eqiad.wmnet [18:31:40] (03PS2) 10Ayounsi: icinga: allow command execution for Ayounsi [puppet] - 10https://gerrit.wikimedia.org/r/346184 (https://phabricator.wikimedia.org/T162073) (owner: 10Dzahn) [18:32:52] http://imgur.com/a/Txu2Y babel looks good on elwikisource [18:35:32] ping Dereckson [18:35:45] pong [18:35:55] pang [18:36:02] babel good to go [18:36:10] :) [18:36:11] Yes I've seen it, will sync in a few moments [18:36:23] alright, cheers [18:37:01] syncing [18:37:39] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Configure Babel for elwikisource (T161593) (duration: 00m 44s) [18:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:46] T161593: Configure extension:Babel for el.wikisource - https://phabricator.wikimedia.org/T161593 [18:38:24] great, thanks for stilling merging the changes even though I was absent [18:38:38] Thanks for the changes :) [18:39:11] I've notified the communities it's live. [18:41:14] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [18:41:59] also, Dereckson, may I also make gerrit patches for https://phabricator.wikimedia.org/T161529 or is it only people with specific access? [18:42:29] DatGut you can create patches for it [18:42:38] i helped to create a wikipedia before. [18:42:52] I forgot the name of the wikipedia though as it was last year [18:44:49] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [18:45:51] alright, thanks [18:46:09] PROBLEM - HP RAID on dbstore2001 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [18:46:55] (03PS2) 10Rush: nfsclient: Enable lookupcache by default for all nfs client instances [puppet] - 10https://gerrit.wikimedia.org/r/346177 (https://phabricator.wikimedia.org/T136712) (owner: 10Madhuvishy) [18:46:57] (03CR) 10Rush: [C: 031] nfsclient: Enable lookupcache by default for all nfs client instances [puppet] - 10https://gerrit.wikimedia.org/r/346177 (https://phabricator.wikimedia.org/T136712) (owner: 10Madhuvishy) [18:47:49] RECOVERY - HP RAID on dbstore2001 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Battery/Capacitor [18:48:34] (03CR) 10Madhuvishy: [C: 032] nfsclient: Enable lookupcache by default for all nfs client instances [puppet] - 10https://gerrit.wikimedia.org/r/346177 (https://phabricator.wikimedia.org/T136712) (owner: 10Madhuvishy) [18:48:41] paladox: tcy? ady? [18:48:46] jam? [18:48:49] tcy i think [18:48:50] 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3151772 (10BBlack) [18:50:39] Did gerrit just become faster? [18:50:45] Let's hope this one is actionable, two Wikipedia requests wasn't, one as language engineering wasn't happy with the translation progress, one as there wasn't enough community (according an en. block and a CU on Incubator, sockpuppets was used by the unique active contributor to give the impression they were 3) [18:50:47] It looks to be very fast now. [18:51:10] I'll see to create pa.wikisource Thursday [18:56:18] PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:56:48] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [18:56:55] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Specific PNG thumbnail delivered as [text/html] instead of [image/png] and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3151794 (10ema) Our understanding of the problem so far is that some of our s... [18:59:44] (03PS1) 10Andrew Bogott: toolschecker: Test ldap by checking ou=groups instead of ou=projects [puppet] - 10https://gerrit.wikimedia.org/r/346187 (https://phabricator.wikimedia.org/T126758) [19:01:32] (03CR) 10Andrew Bogott: [C: 032] toolschecker: Test ldap by checking ou=groups instead of ou=projects [puppet] - 10https://gerrit.wikimedia.org/r/346187 (https://phabricator.wikimedia.org/T126758) (owner: 10Andrew Bogott) [19:01:48] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [19:07:58] RECOVERY - Check Varnish expiry mailbox lag on cp1062 is OK: OK: expiry mailbox lag is 0 [19:08:28] PROBLEM - puppet last run on db1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:10:25] jouncebot: next [19:10:25] In 0 hour(s) and 49 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170403T2000) [19:10:29] jouncebot: refresh [19:10:32] I refreshed my knowledge about deployments. [19:10:46] (03PS1) 10Andrew Bogott: toolschecker: The group is 'project-testlabs,' not 'testlabs' [puppet] - 10https://gerrit.wikimedia.org/r/346189 (https://phabricator.wikimedia.org/T126758) [19:12:02] (03CR) 10Andrew Bogott: [C: 032] toolschecker: The group is 'project-testlabs,' not 'testlabs' [puppet] - 10https://gerrit.wikimedia.org/r/346189 (https://phabricator.wikimedia.org/T126758) (owner: 10Andrew Bogott) [19:13:48] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 21 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [19:15:50] !log phabricator/ops: adding ayounsi to WMF-NDA (project 61) and acl*operations-team (project 29) (T162073) [19:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:57] T162073: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073 [19:16:28] 06Operations, 10Wikimedia-Mailing-lists: mailman issue for ops team? - https://phabricator.wikimedia.org/T162080#3151923 (10Legoktm) [19:16:41] I am going to sync some project logos [19:16:44] !log in testlabs, deleted ou=projects,dc=wikimedia,dc=org and ou=roles,dc=wikimedia,dc=org as per T126758 [19:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:51] T126758: Clean up after ldap->mysql keystone migration - https://phabricator.wikimedia.org/T126758 [19:18:54] !log hashar@tin Synchronized static/images/project-logos: Optimize a few project logos - T161999 (duration: 00m 44s) [19:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:02] T161999: Make sure all logos are optimalized - https://phabricator.wikimedia.org/T161999 [19:21:26] !log Finished deployment of project-logos optimization for T161999 / https://gerrit.wikimedia.org/r/#/c/346057/ . And purged the related logos [19:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:38] 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3151981 (10Dzahn) [x] phabricator permissions to see NDA and Ops restricted tickets I did the same steps that were done by @Aklapper in T144496#2601909. - https... [19:23:09] 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3151984 (10Dzahn) [19:23:13] 06Operations, 10Traffic, 10media-storage, 13Patch-For-Review, 15User-Urbanecm: Specific PNG thumbnail delivered as [text/html] instead of [image/png] and hence not rendered in browser - https://phabricator.wikimedia.org/T162035#3151985 (10ema) There is probably another type of bug responsible for the ERR... [19:24:18] RECOVERY - puppet last run on elastic1020 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [19:28:38] (03PS3) 10Andrew Bogott: Add skin, language, and variant to user_properties_anon [puppet] - 10https://gerrit.wikimedia.org/r/344302 (https://phabricator.wikimedia.org/T152043) (owner: 10Reedy) [19:28:48] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [19:32:42] (03CR) 10Andrew Bogott: [C: 032] Add skin, language, and variant to user_properties_anon [puppet] - 10https://gerrit.wikimedia.org/r/344302 (https://phabricator.wikimedia.org/T152043) (owner: 10Reedy) [19:36:28] RECOVERY - puppet last run on db1023 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [19:54:37] (03CR) 10Andrew Bogott: "An important thing to keep in mind is that ldaplist doesn't currently work correectly for many searches, due to query limits." [puppet] - 10https://gerrit.wikimedia.org/r/337842 (https://phabricator.wikimedia.org/T114063) (owner: 10Hashar) [19:55:48] RECOVERY - Check Varnish expiry mailbox lag on cp1050 is OK: OK: expiry mailbox lag is 0 [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170403T2000). Please do the needful. [20:00:29] no parsoid deploy today [20:03:28] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:06:30] (03PS1) 10Subramanya Sastry: Delink new parsoid-vd test runs from updates to parsoid git repo [puppet] - 10https://gerrit.wikimedia.org/r/346196 [20:06:35] 06Operations, 07HHVM: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ) - https://phabricator.wikimedia.org/T161684#3152168 (10hashar) {T89912} has some related clues, specially a debugging session T89912#1286874 which mentions concurrent_hash_map and the StatCache holdi... [20:07:19] 06Operations, 07HHVM: HHVM lock-ups - https://phabricator.wikimedia.org/T89912#1048540 (10hashar) HHVM 3.18 has a similar deadlock that happens after just a few hours of live traffic. T161684 Most probably the same deadlock. [20:08:45] 06Operations, 07HHVM: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ) - https://phabricator.wikimedia.org/T161684#3139527 (10hashar) And I mentioned it somewhere else, the statcache got enabled via T75706. At the time that has cut system CPU by half. [20:10:47] (03CR) 10Subramanya Sastry: "https://github.com/wikimedia/integration-visualdiff/blob/e5ce302e8ab51303d8d2fc49f6463815e9c3adee/testreduce/client.scripts.js#L56-L60 is " [puppet] - 10https://gerrit.wikimedia.org/r/346196 (owner: 10Subramanya Sastry) [20:13:19] 06Operations, 07HHVM: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ) - https://phabricator.wikimedia.org/T161684#3152193 (10MoritzMuehlenhoff) The current performance loss seems less significant though (e.g. compare mw1261 with HHVM 3.18 and stat_cache disabled to mw1... [20:15:41] (03CR) 10Ottomata: [C: 031] Create a separate sysctl configuration for setting conntrack settings [puppet] - 10https://gerrit.wikimedia.org/r/319071 (owner: 10Muehlenhoff) [20:19:28] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [20:24:37] 06Operations, 06Labs: Investigate alternative RAID strategies for labstore1001/2 - https://phabricator.wikimedia.org/T162090#3152197 (10madhuvishy) [20:26:17] 06Operations, 07HHVM: HHVM 3.18 deadlocks after 4-6 hours (stuck in in HPHP::Treadmill::getAgeOldestRequest() ) - https://phabricator.wikimedia.org/T161684#3139527 (10Joe) >>! In T161684#3152193, @MoritzMuehlenhoff wrote: > The current performance loss seems less significant though (e.g. compare mw1261 with HH... [20:27:18] RECOVERY - salt-minion processes on analytics1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:27:18] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [20:29:25] (03CR) 10Chad: "Is this an actual problem we're solving?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346180 (owner: 10Paladox) [20:30:21] (03PS3) 10Paladox: Gerrit: Disable md5 in ssh [puppet] - 10https://gerrit.wikimedia.org/r/346180 [20:30:51] 06Operations, 10Ops-Access-Requests, 10Traffic, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3152222 (10ayounsi) @Muehlenhoff, here is my public GPG key for pwstore. ``` -----BEGIN PGP PUBLIC KEY BLOCK----- mQGiBEtGU7gRBADRV1Z96fsxR6riZOD1bL3PVhyKntVakX... [20:31:06] (03CR) 10Paladox: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346180 (owner: 10Paladox) [20:31:29] (03CR) 10Paladox: Gerrit: Disable md5 in ssh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/346180 (owner: 10Paladox) [20:34:28] PROBLEM - puppet last run on mw1186 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:42:18] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:44:26] !log bsitzmann@tin Started deploy [mobileapps/deploy@20ab197]: Update mobileapps to fdd4e31 [20:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:30] (03CR) 10Mobrovac: [C: 031] "+1 for this change, but the referenced function should really take into account the fact that it may be trying to open a file that doesn't" [puppet] - 10https://gerrit.wikimedia.org/r/346196 (owner: 10Subramanya Sastry) [20:47:31] !log bsitzmann@tin Finished deploy [mobileapps/deploy@20ab197]: Update mobileapps to fdd4e31 (duration: 03m 05s) [20:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:12] (03CR) 10Subramanya Sastry: "I initially added a puppet config to initialize the file on ruthenium ... but then backed it out since it is overly defensive code for wha" [puppet] - 10https://gerrit.wikimedia.org/r/346196 (owner: 10Subramanya Sastry) [20:52:13] 06Operations, 06Labs, 13Patch-For-Review: Instance creation fails before first puppet run around 1% of the time - https://phabricator.wikimedia.org/T160908#3152276 (10Andrew) Oddly, labs instances seem to be getting their dhcp leases from install1001: lease { interface "eth0"; fixed-address 10.68.21.59;... [20:55:48] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:00:04] dapatrick, bawolff, and Reedy: Dear anthropoid, the time has come. Please deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170403T2100). [21:01:28] PROBLEM - puppet last run on rdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:03:28] RECOVERY - puppet last run on mw1186 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [21:07:09] (03PS3) 10Ottomata: Improvements to eventlogging_sync.sh script [puppet] - 10https://gerrit.wikimedia.org/r/345646 (https://phabricator.wikimedia.org/T124307) [21:09:40] (03CR) 10Ottomata: "Ook, I've added a -s option to make testing this easier." [puppet] - 10https://gerrit.wikimedia.org/r/345646 (https://phabricator.wikimedia.org/T124307) (owner: 10Ottomata) [21:10:48] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:22:48] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:27:38] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 374 [21:27:48] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:28:08] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [21:29:28] RECOVERY - puppet last run on rdb1001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [21:30:09] chasemp andrewbogott ^^ [21:30:25] 06Operations, 10Stashbot: [IDEA] Backup bot for morebots - https://phabricator.wikimedia.org/T148694#3152351 (10Zppix) Just cleaning up some tasks ive authored, are we done with this task are we still discussing here? [21:30:29] tx paladox [21:30:39] your welcome :) [21:31:12] andrewbogott: we just met our 7 limit [21:31:25] andrewbogott: I'm going to clean house to keep the metrics going [21:31:37] (03PS4) 10Madhuvishy: WIP tools: job to copytruncate logs in place [puppet] - 10https://gerrit.wikimedia.org/r/326153 (owner: 10Rush) [21:37:08] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [21:38:18] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:38:36] (03CR) 10Dzahn: [C: 031] Gerrit: Disable md5 in ssh [puppet] - 10https://gerrit.wikimedia.org/r/346180 (owner: 10Paladox) [21:57:18] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:59:39] "“We’ve decided to reorganise our operations further and therefore, as of 21 March 2014, services provided from all our European websites will be provided by Yahoo! EMEA, in Ireland." [21:59:40] woops [21:59:44] wrong place [22:00:08] PROBLEM - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [22:00:38] PROBLEM - puppet last run on db2068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:00:48] PROBLEM - puppet last run on mw2232 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:02:05] (03CR) 10Paladox: "@BBlack hi could you review please?" [puppet] - 10https://gerrit.wikimedia.org/r/346180 (owner: 10Paladox) [22:04:48] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:06:18] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [22:06:24] !log power cycling lvs2002, it was down and console showed nothing [22:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:28] RECOVERY - Check Varnish expiry mailbox lag on cp1073 is OK: OK: expiry mailbox lag is 294349 [22:09:39] RECOVERY - Host lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms [22:09:48] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:16:16] (03CR) 10Volans: "Approach looks good, just few comments inline, nothing really blocking" (0318 comments) [puppet] - 10https://gerrit.wikimedia.org/r/346118 (owner: 10Giuseppe Lavagetto) [22:16:55] elukey: the "few" was for you ;) ^^^ [22:24:12] 06Operations, 10Stashbot: [IDEA] Backup bot for morebots - https://phabricator.wikimedia.org/T148694#3152537 (10Krinkle) 05Open>03declined [22:28:38] RECOVERY - puppet last run on db2068 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [22:29:48] RECOVERY - puppet last run on mw2232 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [22:34:34] (03PS1) 10Subramanya Sastry: ruthenium: increase parsoid-vd clients from 4 to 6 [puppet] - 10https://gerrit.wikimedia.org/r/346209 [22:35:22] !log volans@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,name=ms-fe1005.eqiad.wmnet [22:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:28] PROBLEM - puppet last run on mw1207 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:37:45] !log volans@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,name=ms-fe1005.eqiad.wmnet [22:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:07] !log completed restart of swift-proxies in eqiad, ms-fe1005 was missing due to swiftrepl stuck/running [22:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:48] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:41:56] 06Operations, 10ops-codfw, 10Traffic: lvs2002 random shut down - https://phabricator.wikimedia.org/T162099#3152568 (10Dzahn) [22:46:48] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:48:15] 06Operations, 10Wikimedia-Mailing-lists: Reset admin password for wikimania-program mailing list - https://phabricator.wikimedia.org/T162080#3152584 (10Aklapper) p:05Triage>03High [22:57:23] 06Operations, 10Wikimedia-Mailing-lists: Reset admin password for wikimania-program mailing list - https://phabricator.wikimedia.org/T162080#3151892 (10Dzahn) @eyoung Hi, you should have received an automatic mail from mailman with a new randomly generated password that you can use to login. Best, Daniel --... [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170403T2300). Please do the needful. [23:00:04] TimStarling and Niharika: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:03:52] (03PS2) 10Dzahn: ssh: avoid hardcoded hostname for yubiauth, add to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/346183 [23:04:05] I can SWAT. TimStarling Niharika ping me when you're available. [23:04:14] thcipriani: Hi! I'm here. [23:05:14] (03PS2) 10Thcipriani: Test LoginNotify on Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345726 (https://phabricator.wikimedia.org/T158878) (owner: 10Niharika29) [23:05:21] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345726 (https://phabricator.wikimedia.org/T158878) (owner: 10Niharika29) [23:06:22] Niharika: hello :) so after this merges I'll sync it out to production to make sure everything merged is deployed, but it will deploy to beta cluster on the next beta-code-update run [23:06:28] RECOVERY - puppet last run on mw1207 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [23:06:48] (03Merged) 10jenkins-bot: Test LoginNotify on Beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345726 (https://phabricator.wikimedia.org/T158878) (owner: 10Niharika29) [23:06:53] which...I just realized is stuck :) [23:06:57] will fix after swat [23:07:17] thcipriani: Okay! [23:09:37] thcipriani: I'm here now [23:10:30] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:345726|Test LoginNotify on Beta cluster]] T158878 (duration: 00m 46s) [23:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:37] T158878: Test LoginNotify Extension on Beta Cluster - https://phabricator.wikimedia.org/T158878 [23:10:50] Niharika: ^ sync'd live, will fix beta deploy shortly :) [23:11:05] thcipriani: Thank you. [23:11:21] hi Niharika :) [23:11:32] Hey saper. How're you doing? [23:11:57] TimStarling: hello! Looks like we need to do a full scap for this one so I'll get that cracking. [23:12:04] yes [23:12:37] (03PS6) 10Thcipriani: Deploy ParserMigration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344276 (https://phabricator.wikimedia.org/T141586) (owner: 10Tim Starling) [23:12:58] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344276 (https://phabricator.wikimedia.org/T141586) (owner: 10Tim Starling) [23:13:48] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [23:13:59] (03Merged) 10jenkins-bot: Deploy ParserMigration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344276 (https://phabricator.wikimedia.org/T141586) (owner: 10Tim Starling) [23:14:28] PROBLEM - puppet last run on rdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:15:46] TimStarling: ok, so I think I'm going to run a full scap without the commonsettings.php change, then after the full scap we can test with the commonsettings.php change on mwdebug1002 and then go all in. Sound sane? [23:16:07] sounds good [23:16:39] sounds commendably cautious [23:17:11] (03CR) 10Mobrovac: "The CPUs are at 80% though and each diff run seems to take at least 10% more. Do we really want to do it? Perhaps try with 5 first?" [puppet] - 10https://gerrit.wikimedia.org/r/346209 (owner: 10Subramanya Sastry) [23:18:34] (03PS3) 10Dzahn: ssh: avoid hardcoded hostname for yubiauth, add to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/346183 [23:18:48] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [23:19:19] !log thcipriani@tin Started scap: SWAT: [[gerrit:344276|Deploy ParserMigration extension]] T141586 (l10nupdate only) [23:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:25] T141586: Deploy ParserMigration extension - https://phabricator.wikimedia.org/T141586 [23:23:58] (03CR) 10Subramanya Sastry: "It is because of parsoid-rt and parsoid-vd kicking off at the same time. Plus testreduce picks the worst failures to retry first and crash" [puppet] - 10https://gerrit.wikimedia.org/r/346209 (owner: 10Subramanya Sastry) [23:25:24] 06Operations, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3152656 (10aaron) In $wmgRedisQueueBaseConfig in wmf-config/jobqueue.php I see the timeout is currently 0.3.... [23:40:48] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [23:41:43] !log thcipriani@tin Finished scap: SWAT: [[gerrit:344276|Deploy ParserMigration extension]] T141586 (l10nupdate only) (duration: 22m 24s) [23:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:50] T141586: Deploy ParserMigration extension - https://phabricator.wikimedia.org/T141586 [23:42:28] RECOVERY - puppet last run on rdb1003 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [23:43:36] ^ TimStarling ok so l10n should be up-to-date, I pulled the updated commonsettings.php on mwdebug1002, check please [23:45:48] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [23:46:34] 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3152678 (10Dereckson) [23:51:08] (03CR) 10jenkins-bot: Deploy ParserMigration extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344276 (https://phabricator.wikimedia.org/T141586) (owner: 10Tim Starling) [23:52:18] thcipriani: looks fine [23:52:29] TimStarling: ok, going live everywhere [23:53:00] I enabled the user preference and tested the new editor [23:54:01] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:344276|Deploy ParserMigration extension]] T141586 (for real) (duration: 00m 44s) [23:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:08] T141586: Deploy ParserMigration extension - https://phabricator.wikimedia.org/T141586 [23:54:35] looks good [23:55:24] (03PS4) 10Dzahn: ssh: avoid hardcoded hostname for yubiauth, add to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/346183 [23:57:07] cool, thanks for checking, logs seem ok too :) [23:57:51] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1086" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346144 (owner: 10Marostegui) [23:58:47] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1015 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346160 (https://phabricator.wikimedia.org/T159319) (owner: 10Marostegui) [23:59:38] (03CR) 10jenkins-bot: Optimalize all not-optimalized logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/346057 (https://phabricator.wikimedia.org/T161999) (owner: 10Urbanecm)