[00:21:51] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1971 bytes in 0.104 second response time [00:33:10] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 330 MB (3% inode=75%) [02:40:05] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.29) (duration: 05m 56s) [02:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:10] (03PS1) 10ArielGlenn: no attribute caching on snaphot1005 [puppet] - 10https://gerrit.wikimedia.org/r/428267 (https://phabricator.wikimedia.org/T191177) [05:17:26] !log Deploy schema change on db1070 (s5 primary master) - T191519 T188299 T190148 [05:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:34] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [05:17:34] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [05:17:34] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [05:20:43] (03CR) 10ArielGlenn: [C: 032] no attribute caching on snaphot1005 [puppet] - 10https://gerrit.wikimedia.org/r/428267 (https://phabricator.wikimedia.org/T191177) (owner: 10ArielGlenn) [05:28:54] !log flow_subscription empty table from officewiki - T149936 [05:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:00] T149936: Drop flow_subscription table - https://phabricator.wikimedia.org/T149936 [05:30:00] (03PS2) 10Marostegui: wiki replicas: Depool labsdb1010 for MCR table additions [puppet] - 10https://gerrit.wikimedia.org/r/428037 (https://phabricator.wikimedia.org/T184446) (owner: 10Bstorm) [05:36:43] 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4148683 (10Marostegui) `atop`has been running without `-R` for the whole weekend and this has caused no errors: https://logstash.wikimedia.org/goto/6c9cfe4615f0538d8e633d299609e7e0 the last error shown th... [06:27:14] !log Remove logging_pre_1_10 from codfw - T118859 [06:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:14] T118859: Drop backup tables for old schema changes (aka "x_pre_y" tables) - https://phabricator.wikimedia.org/T118859 [06:56:58] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4148707 (10Joe) >>! In T191921#4147196, @Krinkle wrote: > @thcipriani Hm.. these are seconds though, as opposed to minutes. Is there something differen... [07:00:36] !log upgrading remaining app servers to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build [07:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:33] 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4148760 (10jcrespo) We should decide either drop the package from all production hosts (because it is obsolete thanks to the more granular prometheus) or configure it to run without -R. Can people say wha... [07:23:46] (03PS1) 10Elukey: profile::druid::pivot: move monitoring from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/428280 [07:27:06] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10996/thorium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/428280 (owner: 10Elukey) [07:28:09] (03CR) 10Jcrespo: [C: 032] mariadb: Update mysql 8.0 package [software] - 10https://gerrit.wikimedia.org/r/427926 (owner: 10Jcrespo) [07:29:05] (03PS1) 10ArielGlenn: don't rely on grep trick to see if there's binary crap in xml files [dumps] - 10https://gerrit.wikimedia.org/r/428281 [07:35:08] !log reboot ms-be2034 - stuck in com2 console with "sd 0:1:0:1: rejecting I/O to offline device", not responsive to ssh [07:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:00] !log upgrading remaining API servers to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build [07:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:30] PROBLEM - Host ms-be2034 is DOWN: PING CRITICAL - Packet loss = 100% [07:37:51] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2034 is OK: OK ferm input default policy is set [07:38:00] RECOVERY - Host ms-be2034 is UP: PING OK - Packet loss = 0%, RTA = 37.00 ms [07:38:10] RECOVERY - Check systemd state on ms-be2034 is OK: OK - running: The system is fully operational [07:38:20] RECOVERY - swift-container-updater on ms-be2034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [07:38:30] RECOVERY - MD RAID on ms-be2034 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [07:38:41] RECOVERY - Disk space on ms-be2034 is OK: DISK OK [07:39:15] PROBLEM - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:39:32] <_joe_> uh what's up? [07:40:14] gehel: ^ expected? [07:40:15] <_joe_> pybal is not considering wdqs to be down [07:40:23] <_joe_> godog: I think it's not [07:40:45] lookinhg [07:40:55] _joe_: ah! interesting [07:40:59] <_joe_> moritzm: 10 servers at the same time? [07:41:05] RECOVERY - LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.029 second response time [07:41:07] <_joe_> godog: I suspect we only check some static url [07:41:11] 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4148806 (10Marostegui) Me, personally, never used it. +1 to drop it from my side [07:41:16] nope, we check a real query [07:41:32] <_joe_> so that's pretty strange [07:41:47] service endpoint seems to work... [07:41:56] gehel: gooood morning! :) [07:42:11] <_joe_> yeah that's quite strange [07:42:13] * gehel was planning on a slow start this morning [07:43:27] wdqs1003 seems to have had some trouble, looking [07:43:56] 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4142726 (10ArielGlenn) I use it from time to time, though much more often I inspect the current atop logs. Never use -R though. [07:46:58] there is a new bot coming from a few AWS instances. It is getting throttled, but maybe not aggressively enough [07:47:03] _joe_: yeah, up to 10 for app servers and API [07:47:25] !log Drop table logging_pre_1_10 in s6 - T118859 [07:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:31] T118859: Drop backup tables for old schema changes (aka "x_pre_y" tables) - https://phabricator.wikimedia.org/T118859 [07:47:34] <_joe_> it's a bit too much given the clusters are much smaller than in the past, but if you do that regularly should be ok [07:47:53] <_joe_> gehel: also, there are the OSM sync issues for both maps clusters [07:48:05] * gehel loves Mondays! [07:48:24] the OSM sync is minor, I'll have a look later [07:50:08] !log Drop table logging_pre_1_10 in s2 - T118859 [07:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:43] (03PS3) 10Marostegui: wiki replicas: Depool labsdb1010 for MCR table additions [puppet] - 10https://gerrit.wikimedia.org/r/428037 (https://phabricator.wikimedia.org/T184446) (owner: 10Bstorm) [07:54:21] (03CR) 10Marostegui: [C: 032] wiki replicas: Depool labsdb1010 for MCR table additions [puppet] - 10https://gerrit.wikimedia.org/r/428037 (https://phabricator.wikimedia.org/T184446) (owner: 10Bstorm) [07:54:35] (03PS2) 10Filippo Giunchedi: rubocop: display cop names [puppet] - 10https://gerrit.wikimedia.org/r/427619 [07:55:08] 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4142726 (10Volans) Personally never used, +1 to drop it. Is there any data that atop provides that is not already available in grafana? [07:55:20] !log Depool labsdb1010 - T184446 [07:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:26] T184446: Configure Toolforge replica views and dumps for the new MCR tables - https://phabricator.wikimedia.org/T184446 [07:55:32] !log reload haproxy on dbproxy1010 to depool labsdb1010 [07:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:43] (03CR) 10Filippo Giunchedi: [C: 032] rubocop: display cop names [puppet] - 10https://gerrit.wikimedia.org/r/427619 (owner: 10Filippo Giunchedi) [07:58:41] !log cp-misc: upgrade varnish to 5.1.3-1wm7 [07:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:13] 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4148839 (10jcrespo) [08:02:43] PROBLEM - WDQS HTTP Port on wdqs1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 3462 bytes in 0.002 second response time [08:04:27] !log upgrading terbium to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build [08:04:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:27] 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4148840 (10jcrespo) Alternatively, only the cron.d/daemon is problematic- we could keep the package (*if* it is useful interactively for some and drop the crontab file/systemd unit). [08:08:05] !log restarting blazegraph on wdqs1003 (crazy number of java threads) [08:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:57] <_joe_> !log restarting memcached in codfw (T184854) [08:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:03] T184854: hhvm memcached and php7 memcached extensions do not play well together - https://phabricator.wikimedia.org/T184854 [08:09:52] RECOVERY - WDQS HTTP Port on wdqs1003 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.028 second response time [08:13:02] !log Drop table logging_pre_1_10 in s7 - T118859 [08:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:08] T118859: Drop backup tables for old schema changes (aka "x_pre_y" tables) - https://phabricator.wikimedia.org/T118859 [08:14:15] 10Operations, 10ops-codfw: Degraded RAID on ms-be2034 - https://phabricator.wikimedia.org/T192721#4148845 (10fgiunchedi) Host was rebooted by @elukey this morning, though upon reboot the raid is assembled correctly: ``` root@ms-be2034:~# cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [... [08:17:53] <_joe_> !log upgrading nginx on the config cluster in codfw (T164456) [08:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:00] T164456: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 [08:18:36] wdqs seems to be back under control after restart of blazegraph on wdqs1003 [08:18:46] that still needs a bit more investigation [08:18:57] (03PS4) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-snmp-exporter [puppet] - 10https://gerrit.wikimedia.org/r/424243 (https://phabricator.wikimedia.org/T135991) [08:20:12] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for prometheus-snmp-exporter [puppet] - 10https://gerrit.wikimedia.org/r/424243 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:21:39] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-nutcracker-exporter [puppet] - 10https://gerrit.wikimedia.org/r/427884 (https://phabricator.wikimedia.org/T135991) [08:24:31] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for prometheus-nutcracker-exporter [puppet] - 10https://gerrit.wikimedia.org/r/427884 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:25:13] PROBLEM - PyBal connections to etcd on lvs5001 is CRITICAL: CRITICAL: 0 connections established with conf2003.codfw.wmnet:2379 (min=4) [08:25:14] (03PS7) 10Volans: Cumin masters in WMCS: upgrade to python3 [puppet] - 10https://gerrit.wikimedia.org/r/419131 (https://phabricator.wikimedia.org/T188112) [08:25:16] (03PS8) 10Volans: Cumin masters in prod: upgrade to python3 [puppet] - 10https://gerrit.wikimedia.org/r/412894 (https://phabricator.wikimedia.org/T187773) [08:25:42] PROBLEM - PyBal connections to etcd on lvs4005 is CRITICAL: CRITICAL: 0 connections established with conf2003.codfw.wmnet:2379 (min=4) [08:26:08] _joe_: just FYI, I know you're restarting them ^^^ [08:26:12] PROBLEM - PyBal connections to etcd on lvs5003 is CRITICAL: CRITICAL: 5 connections established with conf2003.codfw.wmnet:2379 (min=12) [08:26:18] <_joe_> volans: yeah we know [08:26:22] (03PS1) 10Jcrespo: mariadb: Depool db1110 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428285 [08:26:43] <_joe_> I'm restarting pybal on 5003, vgutierrez [08:26:57] ack [08:27:11] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/427889 (https://phabricator.wikimedia.org/T135991) [08:27:12] PROBLEM - PyBal connections to etcd on lvs2004 is CRITICAL: CRITICAL: 0 connections established with conf2001.codfw.wmnet:2379 (min=4) [08:27:18] (03CR) 10Marostegui: [C: 031] "no maintenance from my side" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428285 (owner: 10Jcrespo) [08:27:22] <_joe_> !log restarting pybal on lvs5003 [08:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:35] !logging restarting pybal on lvs4005 [08:27:35] To log a message, use the following format: !log [08:27:38] ar [08:27:41] !log restarting pybal on lvs4005 [08:27:43] (03CR) 10Volans: "> That requires the cumin master to use Stretch. On Jessie there are" [puppet] - 10https://gerrit.wikimedia.org/r/419131 (https://phabricator.wikimedia.org/T188112) (owner: 10Volans) [08:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:56] <_joe_> vgutierrez: yeah the restart sealed the deal [08:28:06] yup... [08:28:25] (03PS2) 10ArielGlenn: don't rely on grep trick to see if there's binary crap in xml files [dumps] - 10https://gerrit.wikimedia.org/r/428281 [08:28:33] PROBLEM - PyBal connections to etcd on lvs2001 is CRITICAL: CRITICAL: 0 connections established with conf2001.codfw.wmnet:2379 (min=4) [08:28:33] PROBLEM - PyBal connections to etcd on lvs2005 is CRITICAL: CRITICAL: 4 connections established with conf2001.codfw.wmnet:2379 (min=14) [08:29:12] PROBLEM - PyBal connections to etcd on lvs2002 is CRITICAL: CRITICAL: 4 connections established with conf2001.codfw.wmnet:2379 (min=14) [08:29:13] PROBLEM - PyBal connections to etcd on lvs5002 is CRITICAL: CRITICAL: 2 connections established with conf2003.codfw.wmnet:2379 (min=8) [08:29:32] PROBLEM - PyBal connections to etcd on lvs4006 is CRITICAL: CRITICAL: 0 connections established with conf2003.codfw.wmnet:2379 (min=8) [08:30:07] !log Drop table logging_pre_1_10 in s4 - T118859 [08:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:13] T118859: Drop backup tables for old schema changes (aka "x_pre_y" tables) - https://phabricator.wikimedia.org/T118859 [08:30:23] PROBLEM - PyBal connections to etcd on lvs2006 is CRITICAL: CRITICAL: 2 connections established with conf2001.codfw.wmnet:2379 (min=29) [08:30:23] !log restarting pybal on lvs5001 [08:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:42] RECOVERY - PyBal connections to etcd on lvs4005 is OK: OK: 4 connections established with conf2003.codfw.wmnet:2379 (min=4) [08:30:52] PROBLEM - PyBal connections to etcd on lvs2003 is CRITICAL: CRITICAL: 2 connections established with conf2001.codfw.wmnet:2379 (min=29) [08:31:12] RECOVERY - PyBal connections to etcd on lvs5003 is OK: OK: 12 connections established with conf2003.codfw.wmnet:2379 (min=12) [08:31:23] !log restarting pybal on lvs5002 [08:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:12] PROBLEM - PyBal connections to etcd on lvs4007 is CRITICAL: CRITICAL: 0 connections established with conf2003.codfw.wmnet:2379 (min=12) [08:32:20] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/427889 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:33:26] !log restart pybal on lvs4007 [08:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:12] RECOVERY - PyBal connections to etcd on lvs5002 is OK: OK: 8 connections established with conf2003.codfw.wmnet:2379 (min=8) [08:35:13] RECOVERY - PyBal connections to etcd on lvs5001 is OK: OK: 4 connections established with conf2003.codfw.wmnet:2379 (min=4) [08:36:06] (03PS1) 10Marostegui: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428286 [08:36:10] !log restarting pybal on codfw (once at a time) [08:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:12] RECOVERY - PyBal connections to etcd on lvs4007 is OK: OK: 12 connections established with conf2003.codfw.wmnet:2379 (min=12) [08:37:12] RECOVERY - PyBal connections to etcd on lvs2004 is OK: OK: 4 connections established with conf2001.codfw.wmnet:2379 (min=4) [08:37:46] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428286 (owner: 10Marostegui) [08:38:33] RECOVERY - PyBal connections to etcd on lvs2001 is OK: OK: 4 connections established with conf2001.codfw.wmnet:2379 (min=4) [08:38:55] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428286 (owner: 10Marostegui) [08:39:15] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1078 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428286 (owner: 10Marostegui) [08:39:45] (03CR) 10Mark Bergsma: [C: 032] Handle non-IDLE states in idleHoldTimeEvent [debs/pybal] - 10https://gerrit.wikimedia.org/r/423997 (owner: 10Mark Bergsma) [08:40:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1078 (duration: 01m 18s) [08:40:36] (03Merged) 10jenkins-bot: Handle non-IDLE states in idleHoldTimeEvent [debs/pybal] - 10https://gerrit.wikimedia.org/r/423997 (owner: 10Mark Bergsma) [08:40:38] (03Merged) 10jenkins-bot: Fix sendNotification invocation [debs/pybal] - 10https://gerrit.wikimedia.org/r/423998 (owner: 10Mark Bergsma) [08:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:40] (03Merged) 10jenkins-bot: Fix two typos in bgp.FSM.openReceived [debs/pybal] - 10https://gerrit.wikimedia.org/r/423999 (owner: 10Mark Bergsma) [08:40:42] RECOVERY - PyBal connections to etcd on lvs2002 is OK: OK: 14 connections established with conf2001.codfw.wmnet:2379 (min=14) [08:40:42] RECOVERY - PyBal connections to etcd on lvs2005 is OK: OK: 14 connections established with conf2001.codfw.wmnet:2379 (min=14) [08:41:10] (03PS3) 10Jcrespo: mariadb: Create /run/mysqld on server start with tmpfiles.d [puppet] - 10https://gerrit.wikimedia.org/r/427902 [08:41:12] (03PS3) 10Jcrespo: mariadb: Do not create /srv/sqldata and /srv/tmp if datadir is false [puppet] - 10https://gerrit.wikimedia.org/r/427904 [08:42:42] RECOVERY - PyBal connections to etcd on lvs2003 is OK: OK: 29 connections established with conf2001.codfw.wmnet:2379 (min=29) [08:42:52] !log restarting pybal on lvs4006 [08:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:32] RECOVERY - PyBal connections to etcd on lvs4006 is OK: OK: 8 connections established with conf2003.codfw.wmnet:2379 (min=8) [08:45:13] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428288 [08:45:22] RECOVERY - PyBal connections to etcd on lvs2006 is OK: OK: 29 connections established with conf2001.codfw.wmnet:2379 (min=29) [08:47:01] !log Dropped table logging_pre_1_10 in s3 - T118859 [08:47:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:07] T118859: Drop backup tables for old schema changes (aka "x_pre_y" tables) - https://phabricator.wikimedia.org/T118859 [08:47:16] !log Drop table logging_pre_1_10 in s5 - T118859 [08:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:34] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428288 (owner: 10Marostegui) [08:48:44] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428288 (owner: 10Marostegui) [08:48:51] <_joe_> !log upgrading nginx on the config cluster in eqiad (T164456) [08:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:56] T164456: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 [08:49:30] (03CR) 10Mark Bergsma: [C: 032] Fix holdTimeEvent incrementing connectRetryCounter twice [debs/pybal] - 10https://gerrit.wikimedia.org/r/424000 (owner: 10Mark Bergsma) [08:49:58] (03Merged) 10jenkins-bot: Fix holdTimeEvent incrementing connectRetryCounter twice [debs/pybal] - 10https://gerrit.wikimedia.org/r/424000 (owner: 10Mark Bergsma) [08:50:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1078 (duration: 01m 16s) [08:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:14] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428288 (owner: 10Marostegui) [08:51:40] 10Operations, 10ops-codfw: Degraded RAID on ms-be2034 - https://phabricator.wikimedia.org/T192721#4148945 (10fgiunchedi) From syslog servers I was able to get kernel messages, in P7024 (omitting the "rejecting i/o to offline device" spammy message). Looks like `hpsa` detected a controller lockup and things sn... [08:52:22] !log restarting pybal on esams cluster [08:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:17] !log reimaging mw1270,mw1271,mw1272 (app servers) to stretch [08:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:11] !log rolling restart of blazegraph on wdqs1004, 2004 and 2005 for JVM upgrade [08:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:40] !log Deploy schema change on s6 codfw master (db2039) - this will generate lag on codfw - T191519 T188299 T190148 [08:57:40] 10Operations, 10ops-codfw: Degraded RAID on ms-be2034 - https://phabricator.wikimedia.org/T192721#4148952 (10fgiunchedi) Looks similar enough to T184390, I'll upgrade the raid controller firmware since that's pending anyways. [08:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:47] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [08:57:47] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [08:57:48] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [08:59:45] !log reimaging mw1283,mw1285,mw1286 (API servers) to stretch [08:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:22] PROBLEM - HHVM rendering on mw2215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:02:23] <_joe_> !log restarting etcdmirror on conf2002 after restarting nginx on conf1001 [09:02:25] 10Operations, 10ops-codfw, 10media-storage: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756#4148967 (10fgiunchedi) [09:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:45] (03PS4) 10Jcrespo: mariadb: Create /run/mysqld on server start with tmpfiles.d [puppet] - 10https://gerrit.wikimedia.org/r/427902 [09:03:10] (03PS4) 10Jcrespo: mariadb: Do not create /srv/sqldata and /srv/tmp if datadir is false [puppet] - 10https://gerrit.wikimedia.org/r/427904 [09:03:12] RECOVERY - HHVM rendering on mw2215 is OK: HTTP OK: HTTP/1.1 200 OK - 78042 bytes in 0.304 second response time [09:04:18] (03CR) 10Mark Bergsma: [C: 032] Fix distinction between events 19 and 20 (delayOpen) [debs/pybal] - 10https://gerrit.wikimedia.org/r/424001 (owner: 10Mark Bergsma) [09:04:52] (03Merged) 10jenkins-bot: Fix distinction between events 19 and 20 (delayOpen) [debs/pybal] - 10https://gerrit.wikimedia.org/r/424001 (owner: 10Mark Bergsma) [09:05:01] !log restarting pybal on lvs1006 [09:05:03] PROBLEM - PyBal connections to etcd on lvs1006 is CRITICAL: CRITICAL: 32 connections established with conf1001.eqiad.wmnet:2379 (min=40) [09:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:12] <_joe_> !log restart memcached on mw1019 (Ttail -f /var/log/etcdmirror-conftool-eqiad-wmnet/syslog.log [09:05:15] <_joe_> arg [09:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:27] <_joe_> that was all wrong [09:05:48] <_joe_> !log AMEND: restart memcached on mc1019 (T184854) [09:05:49] _joe_: at least it was an innocent clipboard leak :P [09:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:54] T184854: hhvm memcached and php7 memcached extensions do not play well together - https://phabricator.wikimedia.org/T184854 [09:06:13] <_joe_> vgutierrez: I can't get used to firefox not using the standard paste ring in Xorg [09:09:12] RECOVERY - PyBal connections to etcd on lvs1006 is OK: OK: 40 connections established with conf1001.eqiad.wmnet:2379 (min=40) [09:09:29] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: WDQS endpoint timeout - https://phabricator.wikimedia.org/T192759#4149000 (10Gehel) [09:11:39] 10Operations, 10Traffic: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456#3234359 (10Vgutierrez) Current picture: ```vgutierrez@neodymium:~$ sudo cumin 'R:class = "tlsproxy::instance"' 'apt-cache policy nginx-full|egrep "Installed:|Candidate:"' 366 hosts will be targeted: conf[2001-2003].codf... [09:11:57] (03Draft1) 10MarcoAurelio: Grant Meta-Wiki sysops the ability to edit global abusefilter rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428290 (https://phabricator.wikimedia.org/T192722) [09:12:01] (03PS2) 10MarcoAurelio: Grant Meta-Wiki sysops the ability to edit global abusefilter rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428290 (https://phabricator.wikimedia.org/T192722) [09:13:35] !log Flashing Smart Array P840 in Slot 3 [ 4.52 -> 6.30 ] on ms-be2034 - T192721 T141756 [09:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:42] T141756: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 [09:13:43] T192721: Degraded RAID on ms-be2034 - https://phabricator.wikimedia.org/T192721 [09:23:16] 10Operations, 10Citoid, 10VisualEditor, 10Patch-For-Review, and 2 others: Some requests for DOIs are failing or very slow; if we have a DOI and the request is taking too long, just use CrossRef data instead. - https://phabricator.wikimedia.org/T165105#4149045 (10Mvolz) >>! In T165105#4080339, @gerritbot wr... [09:25:11] 10Operations, 10ops-codfw: Degraded RAID on ms-be2034 - https://phabricator.wikimedia.org/T192721#4149052 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Firmware upgraded, I'll tentatively resolve and reopen if we see reoccurence [09:27:18] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler02/10999/" [puppet] - 10https://gerrit.wikimedia.org/r/427904 (owner: 10Jcrespo) [09:27:22] (03CR) 10Jcrespo: [C: 032] mariadb: Do not create /srv/sqldata and /srv/tmp if datadir is false [puppet] - 10https://gerrit.wikimedia.org/r/427904 (owner: 10Jcrespo) [09:29:27] (03CR) 10Mark Bergsma: [C: 032] Handle state ESTABLISHED in versionError (event 24) [debs/pybal] - 10https://gerrit.wikimedia.org/r/424002 (owner: 10Mark Bergsma) [09:29:38] (03PS5) 10Jcrespo: mariadb: Create /run/mysqld on server start with tmpfiles.d [puppet] - 10https://gerrit.wikimedia.org/r/427902 [09:29:56] (03Merged) 10jenkins-bot: Handle state ESTABLISHED in versionError (event 24) [debs/pybal] - 10https://gerrit.wikimedia.org/r/424002 (owner: 10Mark Bergsma) [09:37:10] (03CR) 10Mark Bergsma: [C: 032] Handle state OPENSENT in keepAliveEvent (event 11) [debs/pybal] - 10https://gerrit.wikimedia.org/r/424003 (owner: 10Mark Bergsma) [09:37:38] (03Merged) 10jenkins-bot: Handle state OPENSENT in keepAliveEvent (event 11) [debs/pybal] - 10https://gerrit.wikimedia.org/r/424003 (owner: 10Mark Bergsma) [09:40:30] (03CR) 10Ema: [C: 031] alerts: add varnish/nginx HTTP availability [puppet] - 10https://gerrit.wikimedia.org/r/408785 (https://phabricator.wikimedia.org/T186069) (owner: 10Filippo Giunchedi) [09:40:58] (03PS1) 10Muehlenhoff: Inline role::mediawiki::scaler [puppet] - 10https://gerrit.wikimedia.org/r/428295 [09:41:00] (03PS1) 10Muehlenhoff: Remove unused role::mediawiki::scaler [puppet] - 10https://gerrit.wikimedia.org/r/428296 [09:41:24] (03CR) 10jerkins-bot: [V: 04-1] Inline role::mediawiki::scaler [puppet] - 10https://gerrit.wikimedia.org/r/428295 (owner: 10Muehlenhoff) [09:44:14] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1966 bytes in 0.106 second response time [09:45:53] (03PS4) 10Filippo Giunchedi: alerts: add varnish/nginx HTTP availability [puppet] - 10https://gerrit.wikimedia.org/r/408785 (https://phabricator.wikimedia.org/T186069) [09:49:49] (03PS5) 10Filippo Giunchedi: alerts: add varnish/nginx HTTP availability [puppet] - 10https://gerrit.wikimedia.org/r/408785 (https://phabricator.wikimedia.org/T186069) [09:52:12] (03CR) 10Filippo Giunchedi: [C: 032] alerts: add varnish/nginx HTTP availability [puppet] - 10https://gerrit.wikimedia.org/r/408785 (https://phabricator.wikimedia.org/T186069) (owner: 10Filippo Giunchedi) [09:55:41] (03PS1) 10Ladsgroup: mediawiki: Delete pre-2016 autopatrol actions from logging table of wikidata [puppet] - 10https://gerrit.wikimedia.org/r/428297 (https://phabricator.wikimedia.org/T189596) [09:56:30] <_joe_> !log restarting memcached on mc1020-1036 at 1 hour intervals - T184854 [09:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:36] T184854: hhvm memcached and php7 memcached extensions do not play well together - https://phabricator.wikimedia.org/T184854 [09:58:29] 10Operations, 10ops-codfw, 10media-storage: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756#4149128 (10fgiunchedi) Latest audit via `cumin` ``` root@neodymium:~# cumin 'F:manufacturer = HP' 'if [ -x /usr/sbin/hpssacli ] ; then cat /sys/class/scsi_disk/*\:1\:0\... [10:04:53] (03CR) 10Mark Bergsma: [C: 032] Handle state OPENSENT in keepAliveReceived [debs/pybal] - 10https://gerrit.wikimedia.org/r/424004 (owner: 10Mark Bergsma) [10:05:01] (03PS1) 10Muehlenhoff: Don't include mediawiki::multimedia on labweb* [puppet] - 10https://gerrit.wikimedia.org/r/428298 [10:05:22] (03Merged) 10jenkins-bot: Handle state OPENSENT in keepAliveReceived [debs/pybal] - 10https://gerrit.wikimedia.org/r/424004 (owner: 10Mark Bergsma) [10:10:35] (03CR) 10Mark Bergsma: [C: 032] Correctly handle event 9 (connectRetryTimeEvent) in ACTIVE [debs/pybal] - 10https://gerrit.wikimedia.org/r/424005 (owner: 10Mark Bergsma) [10:11:02] (03Merged) 10jenkins-bot: Correctly handle event 9 (connectRetryTimeEvent) in ACTIVE [debs/pybal] - 10https://gerrit.wikimedia.org/r/424005 (owner: 10Mark Bergsma) [10:11:09] (03CR) 10Mark Bergsma: [C: 032] Fix typo in FSM.delayOpenTimeEvent [debs/pybal] - 10https://gerrit.wikimedia.org/r/424006 (owner: 10Mark Bergsma) [10:11:40] (03Merged) 10jenkins-bot: Fix typo in FSM.delayOpenTimeEvent [debs/pybal] - 10https://gerrit.wikimedia.org/r/424006 (owner: 10Mark Bergsma) [10:13:19] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428299 (https://phabricator.wikimedia.org/T128546) [10:13:20] PROBLEM - HTTP availability for Nginx on einsteinium is CRITICAL: cluster={cache_misc,thumbor} site={eqiad,esams} https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:14:26] godog: ^ [10:15:06] PROBLEM - HTTP availability for Varnish on einsteinium is CRITICAL: job=varnish-misc site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:15:17] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428299 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:15:27] ema: thanks, so yeah we'll need to tweak it as expected heh [10:15:34] fair enough! [10:16:29] godog: can we get it to print the computed availability leading to the alert? [10:16:32] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428299 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:16:35] (03PS1) 10Muehlenhoff: Remove obsolete fontconfig/imagemagick code from mediawiki::multimedia [puppet] - 10https://gerrit.wikimedia.org/r/428300 [10:17:00] (03CR) 10jerkins-bot: [V: 04-1] Remove obsolete fontconfig/imagemagick code from mediawiki::multimedia [puppet] - 10https://gerrit.wikimedia.org/r/428300 (owner: 10Muehlenhoff) [10:18:06] PROBLEM - HTTP availability for Varnish on einsteinium is CRITICAL: job=varnish-misc site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:18:10] 10Operations, 10Citoid, 10VisualEditor, 10Patch-For-Review, and 2 others: Some requests for DOIs are failing or very slow; if we have a DOI and the request is taking too long, just use CrossRef data instead. - https://phabricator.wikimedia.org/T165105#4149139 (10mobrovac) Perhaps we should provide differen... [10:18:53] (03PS2) 10Muehlenhoff: Remove obsolete fontconfig/imagemagick code from mediawiki::multimedia [puppet] - 10https://gerrit.wikimedia.org/r/428300 [10:19:25] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428299 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:19:35] ema: for scalar results that's already the case yeah, the value will be printed, for vector results no, we have to try and play with some representations and see what the result is, since there could be many results from a single query [10:19:48] (03PS1) 10Ema: 1.7: fix implicit function declaration warning [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/428301 [10:22:29] (03PS2) 10Ema: 1.7: fix implicit function declaration warning [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/428301 [10:22:58] (03PS2) 10Giuseppe Lavagetto: scap-helm: add logging to SAL for install, upgrade [puppet] - 10https://gerrit.wikimedia.org/r/427337 [10:23:03] (03CR) 10Giuseppe Lavagetto: scap-helm: add logging to SAL for install, upgrade (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/427337 (owner: 10Giuseppe Lavagetto) [10:24:47] (03PS8) 10Volans: Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) [10:24:49] (03PS10) 10Volans: Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) [10:24:51] (03PS6) 10Volans: Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) [10:24:53] (03PS1) 10Volans: Add server side validation of client certificates [software/debmonitor] - 10https://gerrit.wikimedia.org/r/428302 (https://phabricator.wikimedia.org/T167504) [10:24:56] * volans waiting the -1 for all of them [10:25:07] (03CR) 10jerkins-bot: [V: 04-1] Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [10:25:10] (03CR) 10jerkins-bot: [V: 04-1] Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [10:25:12] (03CR) 10jerkins-bot: [V: 04-1] Add server side validation of client certificates [software/debmonitor] - 10https://gerrit.wikimedia.org/r/428302 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [10:25:14] (03CR) 10jerkins-bot: [V: 04-1] Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [10:25:20] !log jdrewniak@tin Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:428299|Bumping portals to master (T128546)]] (duration: 01m 17s) [10:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:26] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:26:37] !log jdrewniak@tin Synchronized portals: Wikimedia Portals Update: [[gerrit:428299|Bumping portals to master (T128546)]] (duration: 01m 16s) [10:26:39] (03CR) 10Volans: "For reference adding the result of tox run locally given that CI is not yet able to run the tests (see T191764)" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [10:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:11] (03CR) 10Alexandros Kosiaris: [C: 04-1] scap-helm: add logging to SAL for install, upgrade (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/427337 (owner: 10Giuseppe Lavagetto) [10:27:17] (03CR) 10Alexandros Kosiaris: [C: 031] scap-helm: add logging to SAL for install, upgrade [puppet] - 10https://gerrit.wikimedia.org/r/427337 (owner: 10Giuseppe Lavagetto) [10:28:58] (03PS1) 10Vgutierrez: Reset waitIndex on etcd error 401 [debs/pybal] - 10https://gerrit.wikimedia.org/r/428303 (https://phabricator.wikimedia.org/T169765) [10:30:17] (03PS3) 10Hashar: Enable base::service_auto_restart for atftpd [puppet] - 10https://gerrit.wikimedia.org/r/424584 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:30:53] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for atftpd [puppet] - 10https://gerrit.wikimedia.org/r/424584 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:32:19] (03PS1) 10Filippo Giunchedi: Restrict HTTP availability alerts to Varnish/Nginx text/upload [puppet] - 10https://gerrit.wikimedia.org/r/428305 (https://phabricator.wikimedia.org/T186069) [10:32:47] ema: ^ [10:34:52] godog: job!= ? [10:36:03] ema: ahem, fixed [10:36:08] (03PS2) 10Filippo Giunchedi: Restrict HTTP availability alerts to Varnish/Nginx text/upload [puppet] - 10https://gerrit.wikimedia.org/r/428305 (https://phabricator.wikimedia.org/T186069) [10:36:54] (03CR) 10Vgutierrez: [C: 032] Move updating of FSM metric labels to the protocol's connectionMade [debs/pybal] - 10https://gerrit.wikimedia.org/r/424007 (owner: 10Mark Bergsma) [10:37:17] sigh... that should be a +1 [10:37:22] sorry mark [10:37:24] (03Merged) 10jenkins-bot: Move updating of FSM metric labels to the protocol's connectionMade [debs/pybal] - 10https://gerrit.wikimedia.org/r/424007 (owner: 10Mark Bergsma) [10:37:30] 10Operations, 10Performance-Team, 10Availability (MediaWiki-MultiDC), 10User-Joe: Create a prometheus exporter for mcrouter - https://phabricator.wikimedia.org/T192763#4149158 (10Joe) [10:38:07] mark: are you ok with that being merged already or should I revert it? [10:39:19] 10Operations, 10Performance-Team, 10Availability (MediaWiki-MultiDC), 10User-Joe, 10User-fgiunchedi: Create a prometheus exporter for mcrouter - https://phabricator.wikimedia.org/T192763#4149183 (10fgiunchedi) [10:40:20] (03CR) 10Gilles: Remove obsolete fontconfig/imagemagick code from mediawiki::multimedia (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428300 (owner: 10Muehlenhoff) [10:43:26] (03CR) 10Ema: [C: 031] Restrict HTTP availability alerts to Varnish/Nginx text/upload [puppet] - 10https://gerrit.wikimedia.org/r/428305 (https://phabricator.wikimedia.org/T186069) (owner: 10Filippo Giunchedi) [10:43:38] godog: the ship seems ready to depart! [10:44:03] (03PS3) 10Filippo Giunchedi: Restrict HTTP availability alerts to Varnish/Nginx text/upload [puppet] - 10https://gerrit.wikimedia.org/r/428305 (https://phabricator.wikimedia.org/T186069) [10:44:05] neat, let's see [10:44:33] one solution to printing the thresholds for scalars would be to include the "worst" value I guess [10:44:56] (03CR) 10Filippo Giunchedi: [C: 032] Restrict HTTP availability alerts to Varnish/Nginx text/upload [puppet] - 10https://gerrit.wikimedia.org/r/428305 (https://phabricator.wikimedia.org/T186069) (owner: 10Filippo Giunchedi) [10:46:09] (03CR) 10Muehlenhoff: Remove obsolete fontconfig/imagemagick code from mediawiki::multimedia (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/428300 (owner: 10Muehlenhoff) [10:48:22] (03CR) 10Gilles: [C: 031] Remove obsolete fontconfig/imagemagick code from mediawiki::multimedia [puppet] - 10https://gerrit.wikimedia.org/r/428300 (owner: 10Muehlenhoff) [10:51:25] PROBLEM - High lag on wdqs1004 is CRITICAL: 3640 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [10:55:49] vgutierrez: no that's fine [10:56:04] thanks! [10:56:29] (03PS1) 10Muehlenhoff: Stop including mediawiki::packages::multimedia for contint, instead add netpbm [puppet] - 10https://gerrit.wikimedia.org/r/428314 [10:56:43] ack [10:57:25] PROBLEM - HHVM rendering on mw2286 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:57:58] !log mobrovac@tin Started deploy [restbase/deploy@3f3f989]: Add lfnwiki, inhwiki, gorwiki and euwikisource - T192678 [10:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:04] T192678: Add lfnwiki, inhwiki, gorwiki and euwikisource to vars.yaml - RESTBase - https://phabricator.wikimedia.org/T192678 [10:58:30] RECOVERY - HTTP availability for Varnish on einsteinium is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:00:04] jan_drewniak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180423T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:05] !log restarting wdqs updater on all wdqs notes [11:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:09] (03CR) 10Mark Bergsma: Reset waitIndex on etcd error 401 (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/428303 (https://phabricator.wikimedia.org/T169765) (owner: 10Vgutierrez) [11:03:00] (03CR) 10Ema: Reset waitIndex on etcd error 401 (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/428303 (https://phabricator.wikimedia.org/T169765) (owner: 10Vgutierrez) [11:05:13] (03CR) 10Mark Bergsma: [C: 032] Ignore headerError and openMessageError in state IDLE [debs/pybal] - 10https://gerrit.wikimedia.org/r/424008 (owner: 10Mark Bergsma) [11:05:43] (03Merged) 10jenkins-bot: Ignore headerError and openMessageError in state IDLE [debs/pybal] - 10https://gerrit.wikimedia.org/r/424008 (owner: 10Mark Bergsma) [11:09:44] !log mobrovac@tin Finished deploy [restbase/deploy@3f3f989]: Add lfnwiki, inhwiki, gorwiki and euwikisource - T192678 (duration: 11m 47s) [11:09:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:50] T192678: Add lfnwiki, inhwiki, gorwiki and euwikisource to vars.yaml - RESTBase - https://phabricator.wikimedia.org/T192678 [11:10:09] !log mobrovac@tin Started deploy [restbase/deploy@3f3f989]: Add lfnwiki, inhwiki, gorwiki and euwikisource, take #2 - T192678 [11:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:35] (03PS2) 10Vgutierrez: Reset waitIndex on etcd error 401 [debs/pybal] - 10https://gerrit.wikimedia.org/r/428303 (https://phabricator.wikimedia.org/T169765) [11:10:49] (03CR) 10Mark Bergsma: Cleanup module for consistency (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/424009 (owner: 10Mark Bergsma) [11:11:08] (03CR) 10Vgutierrez: Reset waitIndex on etcd error 401 (032 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/428303 (https://phabricator.wikimedia.org/T169765) (owner: 10Vgutierrez) [11:12:11] RECOVERY - HHVM rendering on mw2286 is OK: HTTP OK: HTTP/1.1 200 OK - 78687 bytes in 0.366 second response time [11:12:55] (03CR) 10Mark Bergsma: [C: 031] Reset waitIndex on etcd error 401 [debs/pybal] - 10https://gerrit.wikimedia.org/r/428303 (https://phabricator.wikimedia.org/T169765) (owner: 10Vgutierrez) [11:13:50] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:13:51] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:14:00] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [11:14:31] RECOVERY - High lag on wdqs1004 is OK: (C)3600 ge (W)1200 ge 992 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:14:51] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [11:16:50] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [11:17:29] !log mobrovac@tin Finished deploy [restbase/deploy@3f3f989]: Add lfnwiki, inhwiki, gorwiki and euwikisource, take #2 - T192678 (duration: 07m 21s) [11:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:35] T192678: Add lfnwiki, inhwiki, gorwiki and euwikisource to vars.yaml - RESTBase - https://phabricator.wikimedia.org/T192678 [11:17:41] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [11:17:50] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [11:18:00] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [11:22:19] !log mobrovac@tin Started deploy [citoid/deploy@b3c0818]: Add support for restful crossRef API and Wikidata QIDs - T108175 T176411 [11:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:26] T108175: Add support for restful crossRef api - https://phabricator.wikimedia.org/T108175 [11:22:27] T176411: Fetch citation details from Wikidata, using QIDs - https://phabricator.wikimedia.org/T176411 [11:25:55] !log mobrovac@tin Finished deploy [citoid/deploy@b3c0818]: Add support for restful crossRef API and Wikidata QIDs - T108175 T176411 (duration: 03m 36s) [11:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:09] !log installing poppler security updates [11:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:51] !log reimaging mw1285 (previous attempt had a hardware problem which failed to trigger the reboot via IPMI) ,mw1287,mw1288 (API servers) to stretch [11:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:20] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1971 bytes in 0.111 second response time [11:50:47] !log reimaging mw1238,mw1239,mw1240 (app servers) to stretch [11:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:15] (03CR) 10BBlack: [C: 031] 1.7: fix implicit function declaration warning [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/428301 (owner: 10Ema) [12:01:58] (03PS1) 10Ema: VCL: respond 400 to requests with bad Host values [puppet] - 10https://gerrit.wikimedia.org/r/428321 [12:04:08] (03CR) 10Ema: [C: 032] 1.7: fix implicit function declaration warning [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/428301 (owner: 10Ema) [12:12:33] (03PS1) 10Ema: 1.7: bump version number [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/428324 [12:13:21] (03CR) 10Ema: [C: 032] 1.7: bump version number [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/428324 (owner: 10Ema) [12:16:14] (03CR) 10Ema: [C: 04-1] "There are known cases of internal requests w/o Host header (check_http). Those should be investigated and fixed before merging." [puppet] - 10https://gerrit.wikimedia.org/r/428321 (owner: 10Ema) [12:19:20] 10Operations, 10Performance-Team, 10Availability (MediaWiki-MultiDC): mcrouter production architecture - https://phabricator.wikimedia.org/T192771#4149483 (10Joe) [12:24:29] (03PS1) 10Ema: 1.7-1: fix compile-time warning + wrong distribution [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/428327 [12:26:24] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1971 bytes in 0.094 second response time [12:29:46] PROBLEM - High CPU load on API appserver on mw1287 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:29:46] PROBLEM - High CPU load on API appserver on mw1288 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:31:46] PROBLEM - Nginx local proxy to apache on mw1285 is CRITICAL: connect to address 10.64.16.50 and port 443: Connection refused [12:31:46] PROBLEM - mediawiki-installation DSH group on mw1238 is CRITICAL: Host mw1238 is not in mediawiki-installation dsh group [12:31:46] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1285 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:31:47] PROBLEM - nutcracker process on mw1287 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:31:47] PROBLEM - nutcracker process on mw1288 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:33:27] PROBLEM - Check whether ferm is active by checking the default input chain on mw1285 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:33:27] PROBLEM - puppet last run on mw1287 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:33:27] PROBLEM - puppet last run on mw1288 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:33:36] PROBLEM - HHVM rendering on mw1239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:35:07] moritzm: I guess 87 and 88 isn't you? [12:35:07] PROBLEM - Apache HTTP on mw1287 is CRITICAL: connect to address 10.64.16.52 and port 80: Connection refused [12:35:07] PROBLEM - Apache HTTP on mw1288 is CRITICAL: connect to address 10.64.16.53 and port 80: Connection refused [12:35:07] PROBLEM - DPKG on mw1285 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:35:07] PROBLEM - configured eth on mw1285 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:35:17] PROBLEM - Apache HTTP on mw1240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:35:18] !log sbisson@tin Started deploy [kartotherian/deploy@2195dde]: Deploy kartotherian with new babel fallback rules [12:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:56] PROBLEM - Disk space on mw1285 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:36:56] PROBLEM - dhclient process on mw1285 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:36:56] PROBLEM - Check size of conntrack table on mw1287 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:36:56] PROBLEM - MD RAID on mw1287 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:36:56] PROBLEM - Check size of conntrack table on mw1288 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:36:57] PROBLEM - MD RAID on mw1288 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:37:06] PROBLEM - HHVM rendering on mw1238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:37:06] PROBLEM - Nginx local proxy to apache on mw1240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:38:07] RECOVERY - Apache HTTP on mw1287 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.001 second response time [12:38:36] PROBLEM - HHVM processes on mw1285 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:38:37] PROBLEM - mediawiki-installation DSH group on mw1285 is CRITICAL: Host mw1285 is not in mediawiki-installation dsh group [12:38:37] PROBLEM - Check systemd state on mw1287 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:38:37] PROBLEM - Check systemd state on mw1288 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:38:37] PROBLEM - Check systemd state on mw1240 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:38:46] PROBLEM - Apache HTTP on mw1239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:38:56] PROBLEM - HHVM processes on mw1239 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:38:56] PROBLEM - nutcracker process on mw1239 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:39:07] RECOVERY - Apache HTTP on mw1288 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.001 second response time [12:39:16] PROBLEM - nutcracker process on mw1238 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nutcracker), command name nutcracker [12:39:36] PROBLEM - nutcracker port on mw1238 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [12:39:56] RECOVERY - HHVM processes on mw1239 is OK: PROCS OK: 6 processes with command name hhvm [12:40:06] PROBLEM - High CPU load on API appserver on mw1285 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:40:10] !log sbisson@tin Finished deploy [kartotherian/deploy@2195dde]: Deploy kartotherian with new babel fallback rules (duration: 04m 52s) [12:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:17] PROBLEM - Nginx local proxy to apache on mw1287 is CRITICAL: connect to address 10.64.16.52 and port 443: Connection refused [12:40:17] PROBLEM - Nginx local proxy to apache on mw1288 is CRITICAL: connect to address 10.64.16.53 and port 443: Connection refused [12:40:17] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1287 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:40:17] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1288 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:40:17] PROBLEM - nutcracker port on mw1285 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:40:17] PROBLEM - Nginx local proxy to apache on mw1239 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.019 second response time [12:42:01] (03CR) 10Ema: [C: 032] 1.7-1: fix compile-time warning + wrong distribution [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/428327 (owner: 10Ema) [12:42:05] yeah, reimage spam [12:42:06] PROBLEM - Apache HTTP on mw1238 is CRITICAL: connect to address 10.64.48.73 and port 80: Connection refused [12:42:06] PROBLEM - nutcracker process on mw1285 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:42:06] PROBLEM - Check whether ferm is active by checking the default input chain on mw1287 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:42:06] PROBLEM - Check whether ferm is active by checking the default input chain on mw1288 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:42:07] silencing [12:43:46] PROBLEM - puppet last run on mw1285 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:43:47] PROBLEM - DPKG on mw1287 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:43:47] PROBLEM - configured eth on mw1288 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:43:47] PROBLEM - DPKG on mw1288 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:43:47] PROBLEM - configured eth on mw1287 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:44:22] (03Abandoned) 10Ema: VCL: respond 400 to requests with bad Host values [puppet] - 10https://gerrit.wikimedia.org/r/428321 (owner: 10Ema) [12:44:37] RECOVERY - Check systemd state on mw1240 is OK: OK - running: The system is fully operational [12:45:06] RECOVERY - Nginx local proxy to apache on mw1240 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 7.810 second response time [12:45:17] RECOVERY - Apache HTTP on mw1240 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.063 second response time [12:45:46] RECOVERY - nutcracker port on mw1238 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [12:46:06] RECOVERY - Apache HTTP on mw1238 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 2.346 second response time [12:46:16] RECOVERY - nutcracker process on mw1238 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [12:46:46] RECOVERY - Apache HTTP on mw1239 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 6.239 second response time [12:46:56] RECOVERY - nutcracker process on mw1239 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [12:47:26] RECOVERY - Nginx local proxy to apache on mw1239 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.070 second response time [12:47:42] jouncebot, next [12:47:42] In 0 hour(s) and 12 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180423T1300) [12:48:57] RECOVERY - HHVM rendering on mw1238 is OK: HTTP OK: HTTP/1.1 200 OK - 78676 bytes in 0.153 second response time [12:49:36] RECOVERY - HHVM rendering on mw1239 is OK: HTTP OK: HTTP/1.1 200 OK - 78676 bytes in 0.147 second response time [12:53:43] 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup ms-be1040-1043 - https://phabricator.wikimedia.org/T191896#4149585 (10fgiunchedi) [12:54:06] RECOVERY - Check size of conntrack table on mw1288 is OK: OK: nf_conntrack is 0 % full [12:54:06] RECOVERY - MD RAID on mw1288 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [12:54:16] RECOVERY - Check whether ferm is active by checking the default input chain on mw1288 is OK: OK ferm input default policy is set [12:54:16] RECOVERY - Check whether ferm is active by checking the default input chain on mw1287 is OK: OK ferm input default policy is set [12:54:56] RECOVERY - DPKG on mw1287 is OK: All packages OK [12:54:56] RECOVERY - configured eth on mw1288 is OK: OK - interfaces up [12:54:56] RECOVERY - DPKG on mw1288 is OK: All packages OK [12:54:56] RECOVERY - configured eth on mw1287 is OK: OK - interfaces up [12:54:56] RECOVERY - High CPU load on API appserver on mw1287 is OK: OK - load average: 6.14, 7.45, 4.61 [12:54:57] RECOVERY - High CPU load on API appserver on mw1288 is OK: OK - load average: 10.63, 8.84, 4.96 [12:55:06] RECOVERY - Check size of conntrack table on mw1287 is OK: OK: nf_conntrack is 0 % full [12:55:06] RECOVERY - MD RAID on mw1287 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [12:56:06] RECOVERY - Disk space on mw1285 is OK: DISK OK [12:56:06] RECOVERY - dhclient process on mw1285 is OK: PROCS OK: 0 processes with command name dhclient [12:56:16] RECOVERY - High CPU load on API appserver on mw1285 is OK: OK - load average: 10.55, 8.51, 4.78 [12:56:17] RECOVERY - configured eth on mw1285 is OK: OK - interfaces up [12:56:17] RECOVERY - DPKG on mw1285 is OK: All packages OK [12:56:27] RECOVERY - Nginx local proxy to apache on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 3.938 second response time [12:56:36] RECOVERY - Check whether ferm is active by checking the default input chain on mw1285 is OK: OK ferm input default policy is set [12:56:46] RECOVERY - HHVM processes on mw1285 is OK: PROCS OK: 6 processes with command name hhvm [12:58:59] !log disabling puppet on several mysql hosts before deploying gerrit:427902 [12:59:03] !log Deploy schema change on dbstore1002 s6 - T191519 T188299 T190148 [12:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:11] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [12:59:12] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [12:59:12] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180423T1300). [13:00:05] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:09] I'm here [13:00:21] (03PS6) 10Jcrespo: mariadb: Create /run/mysqld on server start with tmpfiles.d [puppet] - 10https://gerrit.wikimedia.org/r/427902 [13:00:58] I can SWAT today [13:01:06] Urbanecm: I see you have been busy :D [13:01:20] zeljkof, I don't understand :) [13:01:34] anything special about any commit, or the usual dance, mwdebug then deploy? [13:01:48] (03CR) 10Jcrespo: [C: 032] mariadb: Create /run/mysqld on server start with tmpfiles.d [puppet] - 10https://gerrit.wikimedia.org/r/427902 (owner: 10Jcrespo) [13:01:49] RECOVERY - Check systemd state on mw1287 is OK: OK - running: The system is fully operational [13:01:49] RECOVERY - Check systemd state on mw1288 is OK: OK - running: The system is fully operational [13:01:52] Urbanecm: well, all 8 commits are from you :D [13:01:58] :D [13:02:08] RECOVERY - nutcracker process on mw1287 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [13:02:08] RECOVERY - nutcracker process on mw1288 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [13:02:21] 427956 should be deployed directly to prod [13:02:30] 427940 is usual [13:02:38] RECOVERY - Nginx local proxy to apache on mw1287 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.214 second response time [13:02:48] 428053 requires script (syntax is in Gerrit comments) [13:03:19] RECOVERY - nutcracker process on mw1285 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [13:03:38] RECOVERY - nutcracker port on mw1285 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [13:03:41] Generally, when a script is required, it is noted in gerrit comments zeljkof [13:03:51] Urbanecm: ok, reviewing, will ping you as needed [13:03:58] ack [13:04:08] RECOVERY - Nginx local proxy to apache on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.081 second response time [13:08:29] RECOVERY - puppet last run on mw1288 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:08:29] RECOVERY - puppet last run on mw1287 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:08:48] RECOVERY - puppet last run on mw1285 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:10:19] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1287 is OK: OK: synced at Mon 2018-04-23 13:10:14 UTC. [13:10:19] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1288 is OK: OK: synced at Mon 2018-04-23 13:10:14 UTC. [13:11:26] zeljkof, is something happening? :) [13:11:52] Urbanecm: sorry, a bit slow with the setup today, reviewing [13:12:33] !log restarting es2003 to test gerrit:427902 [13:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:58] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427956 (https://phabricator.wikimedia.org/T192668) (owner: 10Urbanecm) [13:13:41] Urbanecm: to confirm, the fist one, 427956, can not be tested at mwdebug, should be deployed to production? [13:13:51] 10Operations, 10DBA: Investigate dropping "edit_page_tracking" database table from Wikimedia wikis after archiving it - https://phabricator.wikimedia.org/T57385#4149677 (10Marostegui) @Nemo_bis ^ [13:13:56] zeljkof, Yes, exactly [13:14:24] (03Merged) 10jenkins-bot: Temp rate limit for arwiki due to mass vandalism [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427956 (https://phabricator.wikimedia.org/T192668) (owner: 10Urbanecm) [13:17:02] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:427956|Temp rate limit for arwiki due to mass vandalism (T192668)]] (duration: 01m 17s) [13:17:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:09] T192668: wgRateLimits in ar.wikipedia - https://phabricator.wikimedia.org/T192668 [13:17:21] Urbanecm: 427956 deployed, please check [13:17:36] (03PS3) 10Zfilipin: Add logos for gorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427940 (https://phabricator.wikimedia.org/T192669) (owner: 10Urbanecm) [13:19:13] Will do, please do next patch in the meantime zeljkof. [13:19:29] (03CR) 10jenkins-bot: Temp rate limit for arwiki due to mass vandalism [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427956 (https://phabricator.wikimedia.org/T192668) (owner: 10Urbanecm) [13:19:42] Urbanecm: sure, reviewing [13:20:16] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427940 (https://phabricator.wikimedia.org/T192669) (owner: 10Urbanecm) [13:21:32] (03Merged) 10jenkins-bot: Add logos for gorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427940 (https://phabricator.wikimedia.org/T192669) (owner: 10Urbanecm) [13:23:40] Urbanecm: 427940 is at mwdebug [13:23:45] ack [13:24:45] zeljkof, working, please deploy [13:25:44] (03CR) 10jenkins-bot: Add logos for gorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427940 (https://phabricator.wikimedia.org/T192669) (owner: 10Urbanecm) [13:26:38] PROBLEM - Check systemd state on db1080 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:26:55] (03PS4) 10Zfilipin: gorwiki: add missing namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428053 (https://phabricator.wikimedia.org/T189109) (owner: 10MarcoAurelio) [13:27:23] !log zfilipin@tin Synchronized static/images/project-logos/: SWAT: [[gerrit:427940|Add logos for gorwiki (T192669)]] (duration: 01m 16s) [13:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:31] T192669: Add logos to gorwiki - https://phabricator.wikimedia.org/T192669 [13:28:08] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/tmpfiles.d/mysqld.conf] [13:28:43] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:427940|Add logos for gorwiki (T192669)]] (duration: 01m 14s) [13:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:27] Urbanecm: 427940 deployed [13:29:32] 10Operations, 10netops: Update BGP_sanitize_in filter - https://phabricator.wikimedia.org/T190317#4149725 (10faidon) I took a careful look at this -- it looks pretty good, but I'd suggest rolling it out slowly in phases just to be on the safe side. That could be separate phases for either the three different t... [13:29:42] zeljkof, thx [13:30:02] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428053 (https://phabricator.wikimedia.org/T189109) (owner: 10MarcoAurelio) [13:30:27] (03PS1) 10Jcrespo: mariadb: Fix tmpfiles.d configuration on mysql hosts [puppet] - 10https://gerrit.wikimedia.org/r/428334 [13:30:52] 10Operations, 10netops: Implement BGP graceful shutdown - https://phabricator.wikimedia.org/T190323#4149726 (10faidon) Easy enough, +1 :) Maybe Add a /* comment */ linking to the NLNOG filter guide? [13:30:53] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Fix tmpfiles.d configuration on mysql hosts [puppet] - 10https://gerrit.wikimedia.org/r/428334 (owner: 10Jcrespo) [13:31:32] (03Merged) 10jenkins-bot: gorwiki: add missing namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428053 (https://phabricator.wikimedia.org/T189109) (owner: 10MarcoAurelio) [13:31:47] RECOVERY - mediawiki-installation DSH group on mw1238 is OK: OK [13:31:47] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1285 is OK: OK: synced at Mon 2018-04-23 13:31:40 UTC. [13:32:19] (03CR) 10jenkins-bot: gorwiki: add missing namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428053 (https://phabricator.wikimedia.org/T189109) (owner: 10MarcoAurelio) [13:32:26] Urbanecm: 428053 is at mwdebug [13:33:14] (03PS2) 10Jcrespo: mariadb: Fix tmpfiles.d configuration on mysql hosts [puppet] - 10https://gerrit.wikimedia.org/r/428334 [13:33:27] zeljkof, working, please deploy and then run the script [13:33:39] Urbanecm: ok [13:33:54] (03CR) 10Jcrespo: [C: 032] mariadb: Fix tmpfiles.d configuration on mysql hosts [puppet] - 10https://gerrit.wikimedia.org/r/428334 (owner: 10Jcrespo) [13:34:17] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/tmpfiles.d/mysqld.conf] [13:35:13] (03PS3) 10Zfilipin: euwikisource: add missing $wgMetaNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428055 (https://phabricator.wikimedia.org/T189465) (owner: 10MarcoAurelio) [13:35:19] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:428053|gorwiki: add missing namespaces (T189109)]] (duration: 01m 17s) [13:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:26] T189109: Create Wikipedia Gorontalo - https://phabricator.wikimedia.org/T189109 [13:36:21] Urbanecm: 428053 deployed, script executed [13:37:08] zeljkof, great, thanks. [13:37:36] (03CR) 10Zfilipin: [C: 032] "script output in T189109#4149738" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428053 (https://phabricator.wikimedia.org/T189109) (owner: 10MarcoAurelio) [13:38:38] RECOVERY - mediawiki-installation DSH group on mw1285 is OK: OK [13:38:45] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428055 (https://phabricator.wikimedia.org/T189465) (owner: 10MarcoAurelio) [13:39:18] PROBLEM - Check systemd state on db2075 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:39:58] (03CR) 10ArielGlenn: [C: 032] don't rely on grep trick to see if there's binary crap in xml files [dumps] - 10https://gerrit.wikimedia.org/r/428281 (owner: 10ArielGlenn) [13:40:00] (03Merged) 10jenkins-bot: euwikisource: add missing $wgMetaNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428055 (https://phabricator.wikimedia.org/T189465) (owner: 10MarcoAurelio) [13:40:56] Urbanecm: 428055 is at mwdebug [13:41:00] ack [13:41:25] zeljkof, working, please deploy and run the script [13:41:46] Urbanecm: ok [13:42:08] (03PS5) 10Zfilipin: lfnwiki: add logo path and missing namespace names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428052 (https://phabricator.wikimedia.org/T183561) (owner: 10MarcoAurelio) [13:42:29] PROBLEM - Check systemd state on db1112 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:43:00] (03PS1) 10Ottomata: Install libjson-perl (JSON.pm) for ezachte [puppet] - 10https://gerrit.wikimedia.org/r/428337 [13:43:06] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:428055|euwikisource: add missing $wgMetaNamespace (T189465)]] (duration: 01m 16s) [13:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:13] T189465: Create Wikisource Basque - https://phabricator.wikimedia.org/T189465 [13:43:17] PROBLEM - Check systemd state on db2080 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:43:26] (03PS3) 10Elukey: role::druid::analytics::worker: upgrade Druid to 0.10 [puppet] - 10https://gerrit.wikimedia.org/r/355471 (https://phabricator.wikimedia.org/T164008) (owner: 10Ottomata) [13:44:32] (03CR) 10jenkins-bot: euwikisource: add missing $wgMetaNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428055 (https://phabricator.wikimedia.org/T189465) (owner: 10MarcoAurelio) [13:44:56] (03CR) 10Ottomata: [C: 032] Install libjson-perl (JSON.pm) for ezachte [puppet] - 10https://gerrit.wikimedia.org/r/428337 (owner: 10Ottomata) [13:45:07] Urbanecm: 428055 deployed, scripts executed [13:45:13] thx [13:46:41] (03CR) 10Zfilipin: [C: 032] "script output at T189465#4149758" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428055 (https://phabricator.wikimedia.org/T189465) (owner: 10MarcoAurelio) [13:47:50] !log decommissioning Cassandra, restbase1010-c -- T189822 [13:48:17] RECOVERY - Check systemd state on db2080 is OK: OK - running: The system is fully operational [13:48:27] RECOVERY - Check systemd state on db2075 is OK: OK - running: The system is fully operational [13:48:31] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428052 (https://phabricator.wikimedia.org/T183561) (owner: 10MarcoAurelio) [13:49:22] Urbanecm: there will probably not be enough time for all 8 commits, any of the remaining has priority, or should I continue in the order? [13:49:28] RECOVERY - Check systemd state on db1112 is OK: OK - running: The system is fully operational [13:49:40] (03Merged) 10jenkins-bot: lfnwiki: add logo path and missing namespace names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428052 (https://phabricator.wikimedia.org/T183561) (owner: 10MarcoAurelio) [13:49:47] RECOVERY - Check systemd state on db1080 is OK: OK - running: The system is fully operational [13:50:32] zeljkof, the last item in the calendar (I mean the script) must be done [13:50:40] The patches are ordered by priority. [13:50:57] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11002/" [puppet] - 10https://gerrit.wikimedia.org/r/355471 (https://phabricator.wikimedia.org/T164008) (owner: 10Ottomata) [13:51:03] (03CR) 10jenkins-bot: lfnwiki: add logo path and missing namespace names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428052 (https://phabricator.wikimedia.org/T183561) (owner: 10MarcoAurelio) [13:51:12] Urbanecm: is the script related to a patch/task? [13:51:26] Urbanecm: 428052 is at mwdebug [13:51:44] zeljkof, T189465 [13:51:57] BTW where's stashbot? [13:52:07] Here it is :) [13:52:14] Urbanecm: could you please add a note to calendar linking the script and the task? [13:52:20] urandom, your log message wasn't logged because stashbot wasn't here [13:52:26] zeljkof, ofc [13:53:31] !log decommissioning Cassandra, restbase1010-c -- T189822 [13:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:40] Urbanecm: thanks [13:53:43] urandom, yw [13:54:21] Urbanecm: just checking, did you see that 428052 is at mwdebug? [13:54:29] zeljkof, didn't saw your message [13:55:07] zeljkof, working, please deplo [13:55:14] Urbanecm: ok [13:55:19] zeljkof: did you ran namespaceDupes.php for lfnwiki as well? I don't see it in the task [13:55:35] !log reimage analytics1067 to Debian Stretch - T192557 [13:55:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:41] Hauskatze: did not deploy it yet, will do right now [13:55:41] T192557: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557 [13:56:26] 10Operations, 10Performance-Team, 10Availability (MediaWiki-MultiDC), 10User-Joe, 10User-fgiunchedi: Create a prometheus exporter for mcrouter - https://phabricator.wikimedia.org/T192763#4149158 (10fgiunchedi) I'll be helping with `mcrouter_exporter` packaging/setup/etc, I tried it and looks like it is d... [13:57:07] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:428052|lfnwiki: add logo path and missing namespace names (T183561)]] (duration: 01m 15s) [13:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:14] T183561: Create Wikipedia Lingua Franca Nova 2 - https://phabricator.wikimedia.org/T183561 [13:58:23] Hauskatze, lfnwiki wasn't deployed yet [13:58:30] (well, it is right now) [13:58:39] (so guess script is running) [13:59:20] (03CR) 10Zfilipin: [C: 032] "script output at T183561#4149802" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428052 (https://phabricator.wikimedia.org/T183561) (owner: 10MarcoAurelio) [14:00:00] Urbanecm, Hauskatze: 428052 deployed and script executed T183561#4149802 [14:00:17] thx [14:00:19] Urbanecm: I will run the last script now, and then close the window, since the time is up [14:00:31] (03PS5) 10Ottomata: Target kafka jmx exporters by profiles instead of roles [puppet] - 10https://gerrit.wikimedia.org/r/427672 [14:00:45] ktnx [14:01:09] (03CR) 10Ottomata: [C: 032] Target kafka jmx exporters by profiles instead of roles [puppet] - 10https://gerrit.wikimedia.org/r/427672 (owner: 10Ottomata) [14:01:34] gorwiki is going to be a pain when the MW main namespaces are deployed [14:02:05] Urbanecm: script output T189465#4149812 [14:02:06] T189465: Create Wikisource Basque - https://phabricator.wikimedia.org/T189465 [14:02:13] !log EU SWAT finished [14:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:34] Urbanecm: thanks for deploying with #releng, please reschedule remaining commits to another swat [14:02:55] Done already :) [14:04:22] no, I mean, the cherry-picked patch you scheduled for morning swat [14:04:41] (03PS1) 10Cmjohnson: Adding dhcpd file for ms-be104-43 [puppet] - 10https://gerrit.wikimedia.org/r/428339 (https://phabricator.wikimedia.org/T191896) [14:05:30] (03CR) 10Cmjohnson: [C: 032] Adding dhcpd file for ms-be104-43 [puppet] - 10https://gerrit.wikimedia.org/r/428339 (https://phabricator.wikimedia.org/T191896) (owner: 10Cmjohnson) [14:05:36] (03PS2) 10Cmjohnson: Adding dhcpd file for ms-be104-43 [puppet] - 10https://gerrit.wikimedia.org/r/428339 (https://phabricator.wikimedia.org/T191896) [14:06:09] 10Operations, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Upgrade mysqld_exporter to 0.10.0 - https://phabricator.wikimedia.org/T161296#4149818 (10jcrespo) This could be done massively right now. Missing hosts with 0.9.0 still (that are not set as spares, waiting for decommissioning): * db[2033-2034... [14:06:45] (03CR) 10Cmjohnson: [V: 032 C: 032] Adding dhcpd file for ms-be104-43 [puppet] - 10https://gerrit.wikimedia.org/r/428339 (https://phabricator.wikimedia.org/T191896) (owner: 10Cmjohnson) [14:07:07] Hauskatze, why do you expect pain? [14:07:20] Murphy's Law [14:07:54] (03PS2) 10Ottomata: Use --new.consumer for main-eqiad -> analytics MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/427406 (https://phabricator.wikimedia.org/T192387) [14:08:57] 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup ms-be1040-1043 - https://phabricator.wikimedia.org/T191896#4149836 (10Cmjohnson) [14:09:17] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [14:10:19] !log switching main -> analytics MirrorMaker to --new.consumer (temporarily stopping puppet on kafka101[234]) https://phabricator.wikimedia.org/T192387 [14:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:32] (03CR) 10Ottomata: [C: 032] Use --new.consumer for main-eqiad -> analytics MirrorMaker [puppet] - 10https://gerrit.wikimedia.org/r/427406 (https://phabricator.wikimedia.org/T192387) (owner: 10Ottomata) [14:11:17] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:15:54] (03CR) 10Andrew Bogott: "I'm not 100% sure that image scaling isn't handled locally on wikitech still... would like to test/understand this better before making th" [puppet] - 10https://gerrit.wikimedia.org/r/428298 (owner: 10Muehlenhoff) [14:22:09] (03PS1) 10Jcrespo: labsdb1007: Fix puppet by adding dummy value to new parameter [puppet] - 10https://gerrit.wikimedia.org/r/428340 [14:30:37] (03PS2) 10Ema: varnishxcache.mtail: add HTTP status info [puppet] - 10https://gerrit.wikimedia.org/r/427373 [14:31:02] [wmf.29] 428283 Add Gorontalo language file, phab:T189127 [14:31:02] T189127: Add Gorontalo language support to MediaWiki - https://phabricator.wikimedia.org/T189127 [14:31:04] this is an UBN [14:31:17] (03CR) 10Ema: [C: 032] varnishxcache.mtail: add HTTP status info [puppet] - 10https://gerrit.wikimedia.org/r/427373 (owner: 10Ema) [14:31:22] zeljkof: please extend windows when there is an UBN to deploy [14:31:38] (and it's a matter of minutes) [14:32:26] Dereckson: hm, I have asked Urbanecm if anything is urgent, as far as I remember, he said the commit are already sorted by priority [14:32:45] I did not check associated tasks one by one [14:32:54] * Dereckson nods [14:33:16] if there is a UBN it can be deployed outside a window, right? [14:33:28] *me said there's nothing urgent in the SWAT itself... [14:33:34] * Urbanecm cannot use /me [14:33:58] okay, let's cherry pick it too for wmf30, so at next train they still have the fix [14:34:01] and deploy it so [14:34:39] Dereckson: if it can wait until the next SWAT, add it to the top with a note it's UBN [14:34:40] Dereckson, why wmf30? I thought gorwiki is at wmf29 [14:34:57] Urbanecm: because when it will be upgraded at wmf30, the change won't be there anymore [14:35:03] it will be included in wmf31 I think [14:35:12] https://gerrit.wikimedia.org/r/#/c/428342/ the cherry pick [14:35:19] wmf30 can wait, sure [14:35:41] outch https://phabricator.wikimedia.org/T189127 [14:35:45] Dereckson, I'm confused. Should I do anything? [14:35:53] indeed, it's for 1.32-wmf.1 [14:36:22] ooops [14:36:42] We need to fix the release notes by the way, it was intended for 1.31. [14:36:53] But it will be released in MediaWiki 1.32. [14:37:09] Dereckson, can you explain to me what's happening and what should happen? :D [14:38:43] Yes: changes to MediaWiki core are picked weekly. As this is a recent change, and our train is lagging, it won't be available in the versions currently deployed (29/30) but will be at 1.32-wmf.1 (the one we'll get this Thursday for gorwiki if train is successful) [14:39:31] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: WDQS endpoint timeout - https://phabricator.wikimedia.org/T192759#4149950 (10Gehel) Queries in error at the time of the issue: https://logstash.wikimedia.org/goto/a84c11d438e757265d6d53d4cb833797... [14:40:13] Dereckson, so...https://gerrit.wikimedia.org/r/#/c/428283/ should be abandoned? [14:40:22] + every 6 months, a new MediaWiki version is created [14:40:39] https://www.mediawiki.org/wiki/Version_lifecycle [14:40:54] the gorwiki langauge change was at first expected for 1.31, but it will be for 1.32 [14:41:17] so if you wish, you can do a commit to remove the line from RELEASE-NOTES-1.31 stating we add gor language and add it to RELEASE-NOTES-1.32. [14:42:09] unless you plan to commit it right away, I'd not put the RELEASE-NOTES in the commit to avoid merge conflicts [14:42:15] thanks for noticing that Dereckson [14:42:22] (about the mw cherry pick) [14:43:05] We still need to deploy ttps://gerrit.wikimedia.org/r/#/c/428283/ to fix "now", and probably a cherry-pick to wmf30 if train doesn't jump a version (I don't know the intent there) [14:43:56] so I'd wait Zuul and Jenkins are done, then deploy https://gerrit.wikimedia.org/r/#/c/428342/ and https://gerrit.wikimedia.org/r/#/c/428283/ both together (testing with 428283 as it's the current version) [14:44:50] Hauskatze: merge conflicts for adding languages are less a probability than for the remaining changes I think [14:53:09] We're waiting https://integration.wikimedia.org/zuul/ [14:58:41] <_joe_> Dereckson: oh the old "waiting for jenkins"? [14:58:43] <_joe_> https://cdn.meme.am/cache/instances/folder46/65289046.jpg [14:59:24] jenkins should be faster [14:59:35] or we should appoint a necromancer [14:59:49] is it servers capacity, ram, processors? [15:00:03] Hauskatze: Jenkins is as fast as the slowest // tasks take time to complete [15:00:16] (+ some limited overhead) [15:00:36] but what takes time here is more phpunit to run the unit tests [15:00:38] then a magic ball to check the future would work [15:00:40] than Jenkins [15:00:57] yep, phpunit is the last test and usually the longest [15:01:29] It's not jenkins, it's the slaves. [15:01:47] PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_text site={codfw,ulsfo} https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:01:51] <_joe_> Hauskatze: as long as there is not some infrastructure-level bottleneck, it's up to developers to make tests efficient; that said, I strongly suspect this has "nodepool" written all over it [15:02:10] <_joe_> what does this alarm tell me ^^ ? [15:02:15] <_joe_> is it actionable? [15:02:20] if the test has -jessie most lily it's nodepool. [15:02:25] lily = likly [15:02:41] _joe_: I think hashar is working very hard to migrate all tests to docker [15:02:57] <_joe_> Hauskatze: inorite [15:03:08] that being said, idk wtf is nodepool, docker or jessie [15:03:27] <_joe_> Hauskatze: operations-puppet after a healthy diet and moving to docker got its CI to run in ~ 15 seconds, down from ~ 3 minutes before [15:03:38] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [15:03:47] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [15:04:07] _joe_: sounds great [15:04:18] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [15:05:05] (nothing on fatalmonitor) [15:05:08] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [15:05:11] ema: codfw backends seems not happy [15:05:14] <_joe_> uhm lemme check [15:05:47] RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:06:08] Hauskatze and Urbanecm > by the way, before the current alert, gor language file is on mwdebug1002 [15:06:18] <_joe_> the problem has already gone away [15:07:05] so, the good news is that the recently merged availability check works fine [15:07:27] (and is not lagging behind like the one based on varnish-aggregate-client-status-codes) [15:08:05] neat [15:08:35] <_joe_> I would suggest that availability checks should be based on our SLO for frontends [15:09:30] Dereckson: I don't see the namespaces on mwdebug1002 [15:09:35] me neither [15:09:38] not at least via recent changes [15:09:55] is it possible that you picked it to the wrong mw version? [15:09:56] https://gor.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces|namespacealiases don't show them [15:10:17] probably the l10n cache to rebuild [15:10:24] (so a full scap) [15:10:25] ah, that might be [15:10:29] yep [15:11:38] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [15:11:58] (03PS1) 10Ottomata: Fix MirrorMaker alert grafana dashboard url [puppet] - 10https://gerrit.wikimedia.org/r/428346 [15:12:27] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [15:13:07] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [15:13:24] (03CR) 10Ottomata: [C: 031] role::druid::analytics::worker: upgrade Druid to 0.10 [puppet] - 10https://gerrit.wikimedia.org/r/355471 (https://phabricator.wikimedia.org/T164008) (owner: 10Ottomata) [15:13:44] !log dereckson@tin Synchronized php-1.31.0-wmf.29/languages/messages/MessagesGor.php: Localisation for MediaWiki in Gorontalo (T189127) (duration: 01m 18s) [15:13:48] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [15:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:51] T189127: Add Gorontalo language support to MediaWiki - https://phabricator.wikimedia.org/T189127 [15:15:05] (03PS2) 10Ottomata: Fix MirrorMaker alert grafana dashboard url [puppet] - 10https://gerrit.wikimedia.org/r/428346 [15:16:48] (03CR) 10Ottomata: [C: 032] Fix MirrorMaker alert grafana dashboard url [puppet] - 10https://gerrit.wikimedia.org/r/428346 (owner: 10Ottomata) [15:20:52] !log dereckson@tin Synchronized php-1.31.0-wmf.30/languages/messages/MessagesGor.php: Localisation for MediaWiki in Gorontalo (T189127) (duration: 01m 16s) [15:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:58] T189127: Add Gorontalo language support to MediaWiki - https://phabricator.wikimedia.org/T189127 [15:23:18] !log dereckson@tin scap sync-l10n completed (1.31.0-wmf.29) (duration: 00m 46s) [15:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:34] or not [15:23:35] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review, 10Wikimedia-Incident: Icinga: page in case all MediaWiki are throwing 5xx - https://phabricator.wikimedia.org/T186069#4150128 (10fgiunchedi) Alerted today, real short-lived issue. Note that the alert is a single one even though its text can change... [15:23:49] filling a bug [15:25:30] (03CR) 10Giuseppe Lavagetto: [C: 032] scap-helm: add logging to SAL for install, upgrade [puppet] - 10https://gerrit.wikimedia.org/r/427337 (owner: 10Giuseppe Lavagetto) [15:25:36] (03PS3) 10Giuseppe Lavagetto: scap-helm: add logging to SAL for install, upgrade [puppet] - 10https://gerrit.wikimedia.org/r/427337 [15:26:48] (03PS1) 10Ottomata: Add prometheus based alerts for main-eqiad -> analytics eqiad [puppet] - 10https://gerrit.wikimedia.org/r/428350 (https://phabricator.wikimedia.org/T192387) [15:27:43] (03PS2) 10Ottomata: Add prometheus based alerts for main-eqiad -> analytics eqiad [puppet] - 10https://gerrit.wikimedia.org/r/428350 (https://phabricator.wikimedia.org/T192387) [15:28:23] !log dereckson@tin Started scap: Rebuild localisation cache to add Gorontalo (T189127)z [15:28:24] !log dereckson@tin scap aborted: Rebuild localisation cache to add Gorontalo (T189127)z (duration: 00m 01s) [15:28:26] !log dereckson@tin Started scap: Rebuild localisation cache to add Gorontalo (T189127) [15:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:29] T189127: Add Gorontalo language support to MediaWiki - https://phabricator.wikimedia.org/T189127 [15:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:50] (03CR) 10Ottomata: [C: 032] Add prometheus based alerts for main-eqiad -> analytics eqiad [puppet] - 10https://gerrit.wikimedia.org/r/428350 (https://phabricator.wikimedia.org/T192387) (owner: 10Ottomata) [15:29:29] Dereckson: https://gor.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&servedby=1&curtimestamp=1&responselanginfo=1&meta=siteinfo&siprop=namespaces does not show the namespaces yet on mwdebug nor on prod, still rebuilding? [15:29:48] 15:28:12 <+logmsgbot> !log dereckson@tin Started scap: Rebuild localisation cache to add Gorontalo (T189127) [15:29:51] yes [15:30:06] oko [15:30:08] *oki [15:34:19] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review, 10Wikimedia-Incident: Icinga: page in case all MediaWiki are throwing 5xx - https://phabricator.wikimedia.org/T186069#3932784 (10ema) >>! In T186069#4150128, @fgiunchedi wrote: > Alerted today, real short-lived issue. Note that the alert is a sing... [15:36:55] !log dereckson@tin Finished scap: Rebuild localisation cache to add Gorontalo (T189127) (duration: 08m 29s) [15:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:02] T189127: Add Gorontalo language support to MediaWiki - https://phabricator.wikimedia.org/T189127 [15:50:24] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4150253 (10Cmjohnson) [15:51:46] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4116638 (10Cmjohnson) @Marostegui @jynus All 8 of the db's are racked, cabled and idrac setup. Updated racktables. These should be ready for you later today/first thing tomorro... [15:52:28] (03PS1) 10Bstorm: Revert "wiki replicas: Depool labsdb1010 for MCR table additions" [puppet] - 10https://gerrit.wikimedia.org/r/428358 [15:52:34] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1116 - db1123 - https://phabricator.wikimedia.org/T191792#4150257 (10Marostegui) @Cmjohnson thanks a lot - RAID also set up? [15:53:58] !log Added slots, slot_roles, content and content_models to views on labsdb1010 [15:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:35] (03PS2) 10Marostegui: Revert "wiki replicas: Depool labsdb1010 for MCR table additions" [puppet] - 10https://gerrit.wikimedia.org/r/428358 (owner: 10Bstorm) [15:55:02] (03CR) 10Marostegui: [C: 032] Revert "wiki replicas: Depool labsdb1010 for MCR table additions" [puppet] - 10https://gerrit.wikimedia.org/r/428358 (owner: 10Bstorm) [15:55:32] !log Reload haproxy on dbproxy1010 to repool labsdb1010 [15:55:34] 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup ms-be1040-1043 - https://phabricator.wikimedia.org/T191896#4150295 (10Cmjohnson) @Marostegui yes raid is setup to raid 10 256K stripe [15:55:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:52] 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup ms-be1040-1043 - https://phabricator.wikimedia.org/T191896#4150310 (10Marostegui) >>! In T191896#4150295, @Cmjohnson wrote: > @Marostegui yes raid is setup to raid 10 256K stripe I guess this is for: T191792 :) [15:59:13] (03PS1) 10Bstorm: wiki replicas: Depool labsdb1011 for MCR table additions [puppet] - 10https://gerrit.wikimedia.org/r/428361 (https://phabricator.wikimedia.org/T184446) [15:59:34] PROBLEM - High lag on wdqs2002 is CRITICAL: 3611 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:59:35] PROBLEM - High lag on wdqs2006 is CRITICAL: 3617 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:59:36] PROBLEM - High lag on wdqs2001 is CRITICAL: 3617 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:59:55] PROBLEM - High lag on wdqs2005 is CRITICAL: 3632 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:00:00] ^looking... [16:00:14] PROBLEM - High lag on wdqs2004 is CRITICAL: 3649 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:00:31] (03PS1) 10Marostegui: db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428362 (https://phabricator.wikimedia.org/T190148) [16:00:41] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4150330 (10BBlack) Are we good for handoff to #Traffic for OS-level install/config now on lvs1016? [16:02:15] looks like an invalid update... [16:02:29] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428362 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [16:03:49] !log restarting wdqs-updater on all nodes [16:03:54] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428362 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [16:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:08] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1096:3316 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428362 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [16:06:20] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1096:3316 for alter table (duration: 01m 16s) [16:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:28] 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4150380 (10BBlack) +1, probably the convservative approach for now would be have puppet disable the systemd unit and remove the cron.daily file (on jessie as well? seems a waste if we expect it's not used). [16:06:51] (03PS1) 10Jforrester: Don't try to set wgSiteSupportPage, ignored for a decade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428365 (https://phabricator.wikimedia.org/T192467) [16:06:59] !log Deploy schema change on db1096:3316 - T191519 T188299 T190148 [16:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:07] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [16:07:07] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [16:07:07] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [16:07:33] Dereckson: still there? [16:08:14] * Dereckson nods [16:08:26] I'm going to reach #mediawiki-i18n, as it doesn't seem to work [16:09:32] Dereckson: yeah, it's not working; but I got to go so I'll not be able to test further just in case you looked for me [16:11:28] (03PS1) 10Herron: certspotter: temporarily disable cron job [puppet] - 10https://gerrit.wikimedia.org/r/428367 [16:14:45] RECOVERY - High lag on wdqs2006 is OK: (C)3600 ge (W)1200 ge 1187 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:15:15] RECOVERY - High lag on wdqs2004 is OK: (C)3600 ge (W)1200 ge 1036 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:15:32] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.26 (duration: 03m 28s) [16:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:44] RECOVERY - High lag on wdqs2002 is OK: (C)3600 ge (W)1200 ge 1005 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:15:46] RECOVERY - High lag on wdqs2001 is OK: (C)3600 ge (W)1200 ge 1132 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:16:04] RECOVERY - High lag on wdqs2005 is OK: (C)3600 ge (W)1200 ge 970 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:16:28] 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4142726 (10fgiunchedi) +1 to remove atop as a daemon/cron, possibly the package altogether too [16:32:02] (03PS1) 10EBernhardson: Enable NUMA awareness in elasticsearch JVM [puppet] - 10https://gerrit.wikimedia.org/r/428372 (https://phabricator.wikimedia.org/T191236) [16:34:54] (03PS3) 10Thcipriani: Pipeline: setup minikube in CI [puppet] - 10https://gerrit.wikimedia.org/r/428010 (https://phabricator.wikimedia.org/T188936) [16:38:50] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4150546 (10thcipriani) using `Eval.Jit=1` does actually seem faster on deployment-tin, anyway ``` [thcipriani@deployment-tin ~]$ export PHP='hhvm -vEva... [16:40:13] 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4150550 (10MoritzMuehlenhoff) I haven't used atop at all so far, +1 to either removing it entirely or dropping service/cron instead (but let's still report this to Debian) [16:40:35] 10Operations, 10Performance-Team, 10Patch-For-Review: Move coal from graphite machine(s) - https://phabricator.wikimedia.org/T159354#4150552 (10Krinkle) [16:41:05] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#4150555 (10Krinkle) [16:41:08] 10Operations, 10Performance-Team, 10Patch-For-Review: Move coal from graphite machine(s) - https://phabricator.wikimedia.org/T159354#3065109 (10Krinkle) 05stalled>03Open [16:41:57] (03PS2) 10EBernhardson: Enable NUMA awareness in elasticsearch JVM [puppet] - 10https://gerrit.wikimedia.org/r/428372 (https://phabricator.wikimedia.org/T191236) [16:44:02] (03PS1) 10Gehel: osm: make post-replicate command optional [puppet] - 10https://gerrit.wikimedia.org/r/428378 [16:44:07] ebernhardson: nice --^ [16:44:46] (03CR) 10jerkins-bot: [V: 04-1] osm: make post-replicate command optional [puppet] - 10https://gerrit.wikimedia.org/r/428378 (owner: 10Gehel) [16:46:04] (03PS2) 10Gehel: osm: make post-replicate command optional [puppet] - 10https://gerrit.wikimedia.org/r/428378 [16:46:30] (03CR) 10Gehel: "puppet compiler looks happy: https://puppet-compiler.wmflabs.org/compiler02/11004/" [puppet] - 10https://gerrit.wikimedia.org/r/428378 (owner: 10Gehel) [16:53:19] (03CR) 10Herron: [C: 032] certspotter: temporarily disable cron job [puppet] - 10https://gerrit.wikimedia.org/r/428367 (owner: 10Herron) [16:56:41] (03PS1) 10Vgutierrez: install_server: Reimage lvs5002 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/428380 (https://phabricator.wikimedia.org/T191897) [16:59:24] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/428378 (owner: 10Gehel) [16:59:56] (03PS3) 10Gehel: osm: make post-replicate command optional [puppet] - 10https://gerrit.wikimedia.org/r/428378 [17:00:04] gehel: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180423T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:00:33] jouncebot: I concur... [17:00:43] (03CR) 10Gehel: [C: 032] osm: make post-replicate command optional [puppet] - 10https://gerrit.wikimedia.org/r/428378 (owner: 10Gehel) [17:02:11] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage lvs5002 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/428380 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [17:02:16] (03PS2) 10Vgutierrez: install_server: Reimage lvs5002 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/428380 (https://phabricator.wikimedia.org/T191897) [17:02:33] (03CR) 10Gehel: [C: 04-2] "This has been replaced by https://gerrit.wikimedia.org/r/#/c/428378/" [puppet] - 10https://gerrit.wikimedia.org/r/428340 (owner: 10Jcrespo) [17:02:42] !log Depool and reimage lvs5002 as stretch - T191897 [17:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:50] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [17:03:44] PROBLEM - HHVM jobrunner on mw1294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:04:45] (03CR) 10Gehel: [C: 031] "LGTM. Merging requires an elasticsearch restart and some planning." [puppet] - 10https://gerrit.wikimedia.org/r/428372 (https://phabricator.wikimedia.org/T191236) (owner: 10EBernhardson) [17:06:14] RECOVERY - puppet last run on labsdb1007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:06:30] train catchup update: I'm going to put wmf.30 on group1 before we do the morning swat window, then hopefully move to all wikis afterwards. [17:07:45] (03PS3) 10Gehel: Enable NUMA awareness in elasticsearch JVM [puppet] - 10https://gerrit.wikimedia.org/r/428372 (https://phabricator.wikimedia.org/T191236) (owner: 10EBernhardson) [17:08:14] (03CR) 10Gehel: [C: 032] "Merging already. We'll restart a few machines today, check, and schedule a full cluster restart." [puppet] - 10https://gerrit.wikimedia.org/r/428372 (https://phabricator.wikimedia.org/T191236) (owner: 10EBernhardson) [17:10:10] (03PS1) 10Thcipriani: Group1 to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428381 [17:11:07] (03PS1) 10Muehlenhoff: Remove mediawiki::firejail [puppet] - 10https://gerrit.wikimedia.org/r/428382 [17:11:35] PROBLEM - HHVM jobrunner on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:12:23] (03CR) 10Thcipriani: [C: 032] Group1 to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428381 (owner: 10Thcipriani) [17:13:30] (03Abandoned) 10Jcrespo: labsdb1007: Fix puppet by adding dummy value to new parameter [puppet] - 10https://gerrit.wikimedia.org/r/428340 (owner: 10Jcrespo) [17:13:34] RECOVERY - HHVM jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.004 second response time [17:13:49] (03Merged) 10jenkins-bot: Group1 to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428381 (owner: 10Thcipriani) [17:15:33] !log thcipriani@tin rebuilt and synchronized wikiversions files: Group1 to 1.31.0-wmf.30 [17:15:44] RECOVERY - HHVM jobrunner on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.010 second response time [17:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:41] !log thcipriani@tin Synchronized php: Group1 to 1.31.0-wmf.30 (duration: 01m 16s) [17:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:12] (03CR) 10jenkins-bot: Group1 to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428381 (owner: 10Thcipriani) [17:22:34] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4150753 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs5002.eqsin.wmnet ``` The log can be found in `/var/lo... [17:24:57] !log pushing firewall block on cr1/2-codfw - T175361 [17:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:04] T175361: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361 [17:26:27] jouncebot: Hello. [17:31:12] PROBLEM - HHVM jobrunner on mw1296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:31:28] jouncebot: next [17:31:28] In 0 hour(s) and 28 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180423T1800) [17:32:02] RECOVERY - HHVM jobrunner on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.004 second response time [17:35:44] !log pushing firewall block on cr1-eqdfw - T175361 [17:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:50] T175361: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361 [17:38:59] (03PS1) 10Fdans: Puppetize cron job archiving old MaxMind files to stat1005 and HDFS [puppet] - 10https://gerrit.wikimedia.org/r/428390 [17:40:36] (03PS2) 10Fdans: Puppetize cron job archiving old MaxMind files to stat1005 and HDFS [puppet] - 10https://gerrit.wikimedia.org/r/428390 [17:40:52] PROBLEM - HHVM jobrunner on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:41:42] RECOVERY - HHVM jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [17:43:16] (03PS3) 10Fdans: Puppetize cron job archiving old MaxMind files to stat1005 and HDFS [puppet] - 10https://gerrit.wikimedia.org/r/428390 [17:45:25] (03PS10) 10Volans: First working version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) [17:45:27] (03PS9) 10Volans: Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) [17:45:29] (03PS11) 10Volans: Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) [17:45:31] (03PS7) 10Volans: Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) [17:45:33] (03PS2) 10Volans: Add server side validation of client certificates [software/debmonitor] - 10https://gerrit.wikimedia.org/r/428302 (https://phabricator.wikimedia.org/T167504) [17:45:46] (03CR) 10jerkins-bot: [V: 04-1] First working version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394620 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [17:45:48] (03CR) 10jerkins-bot: [V: 04-1] Add CLI script to be installed in the target hosts [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394990 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [17:45:50] (03CR) 10jerkins-bot: [V: 04-1] Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [17:45:52] (03CR) 10jerkins-bot: [V: 04-1] Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [17:45:54] (03CR) 10jerkins-bot: [V: 04-1] Add server side validation of client certificates [software/debmonitor] - 10https://gerrit.wikimedia.org/r/428302 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [17:50:43] 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4142726 (10Dzahn) +1 to remove the daemon/cron, keeping the package itself. [17:51:33] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1962 bytes in 0.089 second response time [17:52:34] !log ariel@tin Started deploy [dumps/dumps@02a3e80]: fix up checks for truncated/binary output files [17:52:38] !log ariel@tin Finished deploy [dumps/dumps@02a3e80]: fix up checks for truncated/binary output files (duration: 00m 04s) [17:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:17] 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4142726 (10Vgutierrez) +1 to remove the daemon/cron, keep the package iff strictly necessary for somebody [18:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Morning SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180423T1800). [18:00:05] RoanKattouw and Urbanecm: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:09] Here [18:00:15] I can SWAAT [18:00:16] *SWAT [18:00:40] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [18:00:54] (03PS2) 10Catrope: Enable internationalized maps on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427943 [18:00:59] (03CR) 10Catrope: [C: 032] Enable internationalized maps on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427943 (owner: 10Catrope) [18:02:19] (03Merged) 10jenkins-bot: Enable internationalized maps on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427943 (owner: 10Catrope) [18:02:35] (03CR) 10jenkins-bot: Enable internationalized maps on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427943 (owner: 10Catrope) [18:03:50] PROBLEM - MD RAID on lvs5002 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:04:03] ^^ that's me, reimage in process [18:06:50] RECOVERY - MD RAID on lvs5002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [18:08:20] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable internationalized maps on testwiki (duration: 01m 17s) [18:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:31] (03PS2) 10Catrope: Enable WikiLove on sawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428051 (https://phabricator.wikimedia.org/T192212) (owner: 10Urbanecm) [18:08:38] (03CR) 10Catrope: [C: 032] Enable WikiLove on sawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428051 (https://phabricator.wikimedia.org/T192212) (owner: 10Urbanecm) [18:10:09] (03Merged) 10jenkins-bot: Enable WikiLove on sawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428051 (https://phabricator.wikimedia.org/T192212) (owner: 10Urbanecm) [18:11:13] Urbanecm: WikiLove on sawiki is on mwdebug1002, please test [18:13:37] RoanKattouw, are you sure? [18:13:44] Uh, no, you're right, oops [18:14:22] Urbanecm: Try now [18:15:26] (03CR) 10jenkins-bot: Enable WikiLove on sawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428051 (https://phabricator.wikimedia.org/T192212) (owner: 10Urbanecm) [18:16:31] RoanKattouw, working, please deploy [18:17:25] 10Operations, 10Mail, 10Patch-For-Review: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#4150930 (10herron) Mail log activity has stopped on mx2001 and I'm unable to connect to mx2001:25 from a VPS system outside wikimedia (with confirmed working outbound tcp/25 connectivity).... [18:17:37] (03PS2) 10Catrope: Change timezone for napwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427941 (https://phabricator.wikimedia.org/T192568) (owner: 10Urbanecm) [18:17:41] (03CR) 10Catrope: [C: 032] Change timezone for napwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427941 (https://phabricator.wikimedia.org/T192568) (owner: 10Urbanecm) [18:18:27] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable WikiLove on sawiki (T192212) (duration: 01m 19s) [18:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:34] T192212: Wikilove extension for sawiki - https://phabricator.wikimedia.org/T192212 [18:18:43] (03PS2) 10Herron: install_server: reinstall mx2001 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/427710 (https://phabricator.wikimedia.org/T175361) [18:19:42] (03Merged) 10jenkins-bot: Change timezone for napwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427941 (https://phabricator.wikimedia.org/T192568) (owner: 10Urbanecm) [18:19:48] (03CR) 10Herron: [C: 032] install_server: reinstall mx2001 with stretch [puppet] - 10https://gerrit.wikimedia.org/r/427710 (https://phabricator.wikimedia.org/T175361) (owner: 10Herron) [18:21:53] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4150935 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs5002.eqsin.wmnet'] ``` and were **ALL** successful. [18:22:15] (03CR) 10jenkins-bot: Change timezone for napwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427941 (https://phabricator.wikimedia.org/T192568) (owner: 10Urbanecm) [18:23:39] (03PS1) 10Vgutierrez: pybal: Re-enable BGP in lvs5002 [puppet] - 10https://gerrit.wikimedia.org/r/428400 (https://phabricator.wikimedia.org/T191897) [18:24:10] PROBLEM - HHVM jobrunner on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:24:20] PROBLEM - HHVM jobrunner on mw1293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:24:55] (03CR) 10Vgutierrez: [C: 032] pybal: Re-enable BGP in lvs5002 [puppet] - 10https://gerrit.wikimedia.org/r/428400 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [18:25:03] Urbanecm: "Change timezone for napwiki" is now on mwdebug1002 [18:25:08] ack [18:25:40] working, please deploy [18:26:00] RECOVERY - HHVM jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 1.061 second response time [18:26:41] PROBLEM - HHVM jobrunner on mw1294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:27:11] RECOVERY - HHVM jobrunner on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.003 second response time [18:28:27] !log Repool (Re-enable BGP) lvs5002 - T191897 [18:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:34] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [18:28:45] (03PS1) 10Andrew Bogott: Labtest: try to standardize on a service name for the puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/428402 (https://phabricator.wikimedia.org/T181523) [18:29:49] (03CR) 10Andrew Bogott: [C: 032] Labtest: try to standardize on a service name for the puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/428402 (https://phabricator.wikimedia.org/T181523) (owner: 10Andrew Bogott) [18:30:26] PROBLEM - HHVM jobrunner on mw1293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:30:31] (03PS1) 10Vgutierrez: install_server: Reimage lvs5001 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/428403 (https://phabricator.wikimedia.org/T191897) [18:30:35] RECOVERY - HHVM jobrunner on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.146 second response time [18:31:16] PROBLEM - HHVM jobrunner on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:31:16] RECOVERY - HHVM jobrunner on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.035 second response time [18:31:32] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage lvs5001 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/428403 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [18:31:35] PROBLEM - HHVM jobrunner on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:31:41] (03PS2) 10Vgutierrez: install_server: Reimage lvs5001 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/428403 (https://phabricator.wikimedia.org/T191897) [18:33:25] PROBLEM - HHVM jobrunner on mw1318 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:33:42] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Change timezone for napwiki (T192568) (duration: 01m 31s) [18:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:48] T192568: Changing time settings on nap.wiki - https://phabricator.wikimedia.org/T192568 [18:33:50] !log Depool lvs5001 - T191897 [18:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:56] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [18:34:15] RECOVERY - HHVM jobrunner on mw1318 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.012 second response time [18:34:51] 10Operations, 10Ops-Access-Requests, 10User-Urbanecm: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830#4150980 (10Urbanecm) [18:35:45] RECOVERY - HHVM jobrunner on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 9.252 second response time [18:35:46] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4150992 (10Vgutierrez) [18:38:16] RECOVERY - HHVM jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 6.940 second response time [18:38:37] 10Operations, 10Ops-Access-Requests, 10User-Urbanecm: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830#4150996 (10Urbanecm) [18:38:48] (03CR) 10Catrope: [C: 032] Revert "Revert "Add ruwikimedia to wikidataclient"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428128 (https://phabricator.wikimedia.org/T188456) (owner: 10Urbanecm) [18:38:51] (03PS3) 10Catrope: Revert "Revert "Add ruwikimedia to wikidataclient"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428128 (https://phabricator.wikimedia.org/T188456) (owner: 10Urbanecm) [18:38:57] (03CR) 10Catrope: [C: 032] Revert "Revert "Add ruwikimedia to wikidataclient"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428128 (https://phabricator.wikimedia.org/T188456) (owner: 10Urbanecm) [18:40:13] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: add arlo and scott to parsoid releasers admin group - https://phabricator.wikimedia.org/T192684#4150998 (10Dzahn) Can be merged during this week, just needs to wait for the "3 business day" period which started on Friday. [18:40:15] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [18:40:24] (03Merged) 10jenkins-bot: Revert "Revert "Add ruwikimedia to wikidataclient"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428128 (https://phabricator.wikimedia.org/T188456) (owner: 10Urbanecm) [18:41:35] (03CR) 10jenkins-bot: Revert "Revert "Add ruwikimedia to wikidataclient"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428128 (https://phabricator.wikimedia.org/T188456) (owner: 10Urbanecm) [18:45:13] (03CR) 10Ottomata: "Buncha comments, but I think this can be a lot simpler with cp --link (to copy as hardlinks)." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/428390 (owner: 10Fdans) [18:45:24] (03CR) 10Ottomata: "Let's talk in IRC>" [puppet] - 10https://gerrit.wikimedia.org/r/428390 (owner: 10Fdans) [18:45:55] Urbanecm: wikidataclient on ruwikimedia is now on mwdebug1002 [18:46:05] Sorry for the slowness, my SSH windows keep locking up after inactivity [18:46:58] RoanKattouw, working, please deploy [18:47:18] RoanKattouw, https://patrickmn.com/aside/how-to-keep-alive-ssh-sessions/ might help... [18:47:46] PROBLEM - HHVM jobrunner on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:48:02] Thanks! This started happening recently so maybe the server config changed [18:48:36] RECOVERY - HHVM jobrunner on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.003 second response time [18:48:42] !log catrope@tin Synchronized dblists/wikidataclient.dblist: Add ruwikimedia to wikidataclient (T188456) (duration: 01m 15s) [18:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:48] T188456: Need to use the Wikidata Q for the WMRU site (Wikibase Client) - https://phabricator.wikimedia.org/T188456 [18:50:51] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4151032 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs5001.eqsin.wmnet ``` The log can be found in `/var/lo... [18:51:16] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [18:53:27] (03PS1) 10Ottomata: Prep for profile::kafka::broker for main Kafka [puppet] - 10https://gerrit.wikimedia.org/r/428405 (https://phabricator.wikimedia.org/T192831) [18:54:06] (03CR) 10jerkins-bot: [V: 04-1] Prep for profile::kafka::broker for main Kafka [puppet] - 10https://gerrit.wikimedia.org/r/428405 (https://phabricator.wikimedia.org/T192831) (owner: 10Ottomata) [18:55:41] (03PS2) 10Ottomata: Prep for profile::kafka::broker for main Kafka [puppet] - 10https://gerrit.wikimedia.org/r/428405 (https://phabricator.wikimedia.org/T192831) [18:57:08] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4151056 (10LGoto) [18:59:14] (03PS3) 10Ottomata: Prep for profile::kafka::broker for main Kafka [puppet] - 10https://gerrit.wikimedia.org/r/428405 (https://phabricator.wikimedia.org/T192831) [19:06:33] (03CR) 10Ottomata: [C: 032] "no op in prod https://puppet-compiler.wmflabs.org/compiler02/11006/kafka-jumbo1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/428405 (https://phabricator.wikimedia.org/T192831) (owner: 10Ottomata) [19:09:04] what is: > Notice: Undefined variable: wmgWikibaseSiteGroup in /srv/mediawiki/wmf-config/Wikibase.php on line 263? [19:11:00] is.php/cs.php synced in wrong order? [19:12:22] I don't see any patches from the last window that weren't just IS.php [19:12:33] ^ RoanKattouw did you see that error while deploying? [19:12:37] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: New WDQS clusters eqiad + codfw - https://phabricator.wikimedia.org/T182991#4151115 (10Smalyshev) 05stalled>03Resolved I think this is done now. [19:13:09] wmgWikibaseSiteGroup lacks a default => in IS.php, so maybe on some wikis it is actually not defined [19:13:28] just a guess [19:13:30] I didn't touch that either, I only touched wikibase-client.dblist [19:13:37] But maybe that did cause it [19:14:01] I wasn't diligent enough because I was fighting my own computer at the time, so I didn't look at logstash while deploying (sorry) [19:14:17] Are these notices coming from ruwikimedia by any chance [19:14:39] thcipriani: ^ [19:14:41] ;D [19:15:52] RoanKattouw: most probably yes. ruwikimedia is not defined in initialisesettings.php [19:16:09] OK lemme fix that then [19:18:17] (03PS1) 10Ottomata: Use realm conditional in role::kafka::main::broker so we can test [puppet] - 10https://gerrit.wikimedia.org/r/428410 (https://phabricator.wikimedia.org/T192831) [19:18:48] Hmm I actually have no idea what the correct value is [19:22:14] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11007/" [puppet] - 10https://gerrit.wikimedia.org/r/428410 (https://phabricator.wikimedia.org/T192831) (owner: 10Ottomata) [19:26:31] RoanKattouw: thcipriani : I wrote a quick summary of the above findings on the task https://phabricator.wikimedia.org/T188456#4151178 [19:30:01] I mean, maybe "wikimedia"? But it's for langlinks so idk [19:31:04] should it be reverted until we know the answer? [19:31:21] (I'm unclear of the cleanliness of that revert/what's impacted) [19:31:39] PROBLEM - MD RAID on lvs5001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [19:33:25] ^^ that's me, lvs5001 is being reimaged [19:34:13] !log elukey@tin Started deploy [analytics/pivot/deploy@cb9ddee]: Fix 0.10.0 compatibility - T164008 [19:34:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:19] T164008: Update druid to 0.10 - https://phabricator.wikimedia.org/T164008 [19:34:29] !log elukey@tin Finished deploy [analytics/pivot/deploy@cb9ddee]: Fix 0.10.0 compatibility - T164008 (duration: 00m 17s) [19:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:40] RECOVERY - MD RAID on lvs5001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [19:35:22] (03PS1) 10Ottomata: Include proper profile::kafka::broker class in Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/428415 (https://phabricator.wikimedia.org/T192831) [19:35:40] PROBLEM - HHVM jobrunner on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:03] (03CR) 10Ottomata: [C: 032] Include proper profile::kafka::broker class in Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/428415 (https://phabricator.wikimedia.org/T192831) (owner: 10Ottomata) [19:36:25] (03PS1) 10Andrew Bogott: labtestpuppetmaster: define 'ca_server' [puppet] - 10https://gerrit.wikimedia.org/r/428416 (https://phabricator.wikimedia.org/T181523) [19:36:51] (03PS2) 10Andrew Bogott: labtestpuppetmaster: define 'ca_server' [puppet] - 10https://gerrit.wikimedia.org/r/428416 (https://phabricator.wikimedia.org/T181523) [19:37:39] RECOVERY - HHVM jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.004 second response time [19:38:08] (03CR) 10Andrew Bogott: [C: 032] labtestpuppetmaster: define 'ca_server' [puppet] - 10https://gerrit.wikimedia.org/r/428416 (https://phabricator.wikimedia.org/T181523) (owner: 10Andrew Bogott) [19:39:10] greg-g: It's harmless log spam but still log spam. It'd probably be best to get a hold of a Wikidata person [19:39:20] PROBLEM - HHVM jobrunner on mw1296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:49] PROBLEM - HHVM jobrunner on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:40:10] PROBLEM - HHVM jobrunner on mw1307 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:40:39] RECOVERY - HHVM jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.005 second response time [19:41:09] PROBLEM - HHVM jobrunner on mw1293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:43:19] RECOVERY - HHVM jobrunner on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.004 second response time [19:43:21] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4151265 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs5001.eqsin.wmnet'] ``` and were **ALL** successful. [19:43:49] PROBLEM - HHVM jobrunner on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:43:49] PROBLEM - HHVM jobrunner on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:45:29] PROBLEM - HHVM jobrunner on mw1296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:45:40] RECOVERY - HHVM jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 4.300 second response time [19:45:52] (03PS1) 10Vgutierrez: pybal: Re-enable BGP in lvs5001 [puppet] - 10https://gerrit.wikimedia.org/r/428418 (https://phabricator.wikimedia.org/T191897) [19:45:56] (03PS1) 10Ottomata: Only require java.security if declared (with Kafka ssl_enabled) [puppet] - 10https://gerrit.wikimedia.org/r/428419 (https://phabricator.wikimedia.org/T192831) [19:46:02] RoanKattouw: ack, hopefully antione's task update does that [19:46:09] PROBLEM - HHVM jobrunner on mw1294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:34] (03CR) 10Vgutierrez: [C: 032] pybal: Re-enable BGP in lvs5001 [puppet] - 10https://gerrit.wikimedia.org/r/428418 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [19:46:59] RECOVERY - HHVM jobrunner on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [19:47:19] RECOVERY - HHVM jobrunner on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.004 second response time [19:48:08] Dereckson: https://phabricator.wikimedia.org/T189127#4151289 [19:48:54] Hauskatze: what nikerabbit thinks about that? [19:49:09] RECOVERY - HHVM jobrunner on mw1307 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.019 second response time [19:49:12] I've not asked him, will do [19:49:25] !log Repool (Re-enable BGP) in lvs5001 - T191897 [19:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:32] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [19:49:49] PROBLEM - HHVM jobrunner on mw1318 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:49:49] PROBLEM - HHVM jobrunner on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:50:01] (03PS2) 10Ottomata: Only require java.security if declared (with Kafka ssl_enabled) [puppet] - 10https://gerrit.wikimedia.org/r/428419 (https://phabricator.wikimedia.org/T192831) [19:50:06] (03CR) 10Ottomata: [V: 032 C: 032] Only require java.security if declared (with Kafka ssl_enabled) [puppet] - 10https://gerrit.wikimedia.org/r/428419 (https://phabricator.wikimedia.org/T192831) (owner: 10Ottomata) [19:50:11] 10Operations, 10Mail, 10Patch-For-Review: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#4151293 (10herron) mx2001 has been reinstalled with Stretch, services are configured and test mail messages flow successfully. So far so good. Will plan to re-pool in the morning (PDT) tom... [19:50:49] RECOVERY - HHVM jobrunner on mw1318 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 8.988 second response time [19:51:10] PROBLEM - HHVM jobrunner on mw1293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:51:18] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4151310 (10Vgutierrez) [19:51:20] PROBLEM - HHVM jobrunner on mw1307 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:52:09] RECOVERY - HHVM jobrunner on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 2.937 second response time [19:52:09] RECOVERY - HHVM jobrunner on mw1307 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.004 second response time [19:52:49] RECOVERY - HHVM jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.003 second response time [19:52:49] PROBLEM - HHVM jobrunner on mw1318 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:53:29] PROBLEM - HHVM jobrunner on mw1296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:54:00] RECOVERY - HHVM jobrunner on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.015 second response time [19:54:19] PROBLEM - HHVM jobrunner on mw1293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:54:40] RECOVERY - HHVM jobrunner on mw1318 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.007 second response time [19:54:49] RECOVERY - HHVM jobrunner on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.003 second response time [19:55:09] RECOVERY - HHVM jobrunner on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.003 second response time [19:55:29] RECOVERY - HHVM jobrunner on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [19:56:19] (03PS1) 10Vgutierrez: hieradata: clean-up eqsin lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/428422 (https://phabricator.wikimedia.org/T191897) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180423T2000). [20:00:25] no parsoid deploy today [20:01:06] (03PS1) 10Andrew Bogott: labtestpuppetmaster: further attempts to set the ca_server [puppet] - 10https://gerrit.wikimedia.org/r/428425 [20:01:55] (03CR) 10jerkins-bot: [V: 04-1] labtestpuppetmaster: further attempts to set the ca_server [puppet] - 10https://gerrit.wikimedia.org/r/428425 (owner: 10Andrew Bogott) [20:03:07] (03PS2) 10Andrew Bogott: labtestpuppetmaster: further attempts to set the ca_server [puppet] - 10https://gerrit.wikimedia.org/r/428425 (https://phabricator.wikimedia.org/T181523) [20:03:53] (03CR) 10Andrew Bogott: [C: 032] labtestpuppetmaster: further attempts to set the ca_server [puppet] - 10https://gerrit.wikimedia.org/r/428425 (https://phabricator.wikimedia.org/T181523) (owner: 10Andrew Bogott) [20:09:50] 10Operations, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): mcrouter production architecture - https://phabricator.wikimedia.org/T192771#4151402 (10Imarlier) [20:10:14] 10Operations, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar), 10User-Joe, 10User-fgiunchedi: Create a prometheus exporter for mcrouter - https://phabricator.wikimedia.org/T192763#4151404 (10Imarlier) [20:14:59] !log Purged all languages messages from the cache, for gorwiki (rebuildmessages.php, T189127) [20:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:09] T189127: Add Gorontalo language support to MediaWiki - https://phabricator.wikimedia.org/T189127 [20:22:55] 10Operations, 10Availability (MediaWiki-MultiDC), 10Performance-Team (Radar): Deploy mcrouter to production as a wancache backend - https://phabricator.wikimedia.org/T192370#4151472 (10Imarlier) [20:26:36] !log mholloway-shell@tin Started deploy [mobileapps/deploy@5650605]: Update mobileapps to b011b2a [20:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:33] I'm going to roll forward all wikis to wmf.30 now (got caught in a meeting earlier) [20:29:08] !log ppchelko@tin Started deploy [restbase/deploy@228caf8]: Log the lack of the index entries [20:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:35] 10Operations, 10Puppet: puppetmaster puppet.conf refers to noexistent files - https://phabricator.wikimedia.org/T192848#4151493 (10Andrew) [20:31:58] (03PS1) 10Thcipriani: All wikis to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428507 [20:32:32] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@5650605]: Update mobileapps to b011b2a (duration: 05m 56s) [20:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:28] (03PS1) 10Catrope: Enable non-static internationalized maps on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428508 [20:36:09] (03CR) 10Thcipriani: [C: 032] All wikis to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428507 (owner: 10Thcipriani) [20:36:28] AndyRussG: ^ now apparently :P [20:36:36] 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack and setup ms-be1040-1043 - https://phabricator.wikimedia.org/T191896#4151516 (10Cmjohnson) [20:37:20] 10Operations, 10Patch-For-Review: Rack and setup ms-be1040-1043 - https://phabricator.wikimedia.org/T191896#4120285 (10Cmjohnson) a:05Cmjohnson>03fgiunchedi @fgiunchedi These are all yours for implementation. [20:37:34] (03Merged) 10jenkins-bot: All wikis to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428507 (owner: 10Thcipriani) [20:37:35] Reedy: no time like the present :) [20:37:50] heh [20:38:28] (03CR) 10jenkins-bot: All wikis to 1.31.0-wmf.30 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428507 (owner: 10Thcipriani) [20:40:22] !log thcipriani@tin rebuilt and synchronized wikiversions files: All wikis to 1.31.0-wmf.30 [20:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:41] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1965 bytes in 0.108 second response time [20:41:20] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [20:41:20] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [20:41:30] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [20:42:16] mobileapps deploy earlier for the mobileapps alerts? [20:42:17] bearND: i wonder what this is about [20:42:20] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: puppetmaster hostcert and hostprivkey point to nonexistent files - https://phabricator.wikimedia.org/T179099#4151534 (10herron) [20:42:20] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [20:42:22] 10Operations, 10Puppet: puppetmaster puppet.conf refers to noexistent files - https://phabricator.wikimedia.org/T192848#4151533 (10herron) [20:42:36] bblack: nope, looking into it now, those are new [20:43:05] mdholloway: hmm, no idea yet [20:43:11] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [20:43:11] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [20:43:11] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [20:43:22] lol [20:43:27] !log ppchelko@tin Finished deploy [restbase/deploy@228caf8]: Log the lack of the index entries (duration: 14m 19s) [20:43:30] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [20:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:35] 10Operations, 10Puppet: puppetmaster puppet.conf refers to noexistent files - https://phabricator.wikimedia.org/T192848#4151493 (10herron) Hey @Andrew, there's some additional information about this in T179099 [20:43:35] !log ppchelko@tin Started deploy [restbase/deploy@228caf8]: Log the lack of the index entries, take 2 [20:43:36] (03PS1) 10Cmjohnson: Adding dhcp file db1118-1123 [puppet] - 10https://gerrit.wikimedia.org/r/428512 (https://phabricator.wikimedia.org/T191792) [20:43:40] maybe due to the RB deployment? [20:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:59] most-read depends on AQS [20:44:00] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [20:44:36] bearND: it looks like the internal request rate just spiked [20:45:27] (03PS1) 10Ottomata: Only require specific version of python-tornado in < stretch [puppet] - 10https://gerrit.wikimedia.org/r/428514 (https://phabricator.wikimedia.org/T192832) [20:45:33] specifically for page/definition [20:45:46] (03CR) 10Cmjohnson: [C: 032] Adding dhcp file db1118-1123 [puppet] - 10https://gerrit.wikimedia.org/r/428512 (https://phabricator.wikimedia.org/T191792) (owner: 10Cmjohnson) [20:45:56] (03CR) 10jerkins-bot: [V: 04-1] Only require specific version of python-tornado in < stretch [puppet] - 10https://gerrit.wikimedia.org/r/428514 (https://phabricator.wikimedia.org/T192832) (owner: 10Ottomata) [20:47:25] (03PS2) 10Ottomata: Only require specific version of python-tornado in < stretch [puppet] - 10https://gerrit.wikimedia.org/r/428514 (https://phabricator.wikimedia.org/T192832) [20:47:30] !log ppchelko@tin Finished deploy [restbase/deploy@228caf8]: Log the lack of the index entries, take 2 (duration: 03m 55s) [20:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:58] (03CR) 10Ottomata: [C: 032] "no op https://puppet-compiler.wmflabs.org/compiler02/11009/" [puppet] - 10https://gerrit.wikimedia.org/r/428514 (https://phabricator.wikimedia.org/T192832) (owner: 10Ottomata) [20:49:01] (03PS3) 10Ottomata: Only require specific version of python-tornado in < stretch [puppet] - 10https://gerrit.wikimedia.org/r/428514 (https://phabricator.wikimedia.org/T192832) [20:49:04] (03CR) 10Ottomata: [V: 032 C: 032] Only require specific version of python-tornado in < stretch [puppet] - 10https://gerrit.wikimedia.org/r/428514 (https://phabricator.wikimedia.org/T192832) (owner: 10Ottomata) [20:49:49] bearND: coming back down now [20:50:34] bearND: i guess it does look a side-effect of the restbase deployment, given the concurrent restbase alert/recovery [20:50:36] (03PS1) 10Ottomata: Move eventbus in deployment-prep to new stretch server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428517 (https://phabricator.wikimedia.org/T192832) [20:50:51] bearND: i don't have any other theories, anyway [20:51:16] mdholloway: yes, i concur. deployments of both RB and MCS happened recently [20:51:38] (03CR) 10Jcrespo: "Can we pause the deleting for a while (disable the cron?)? I assume no new items are being added, so we are not in a hurry. However, we ar" [puppet] - 10https://gerrit.wikimedia.org/r/428297 (https://phabricator.wikimedia.org/T189596) (owner: 10Ladsgroup) [20:53:01] (03PS1) 10Ottomata: Move eventbus in deployment-prep to new stretch server [puppet] - 10https://gerrit.wikimedia.org/r/428519 (https://phabricator.wikimedia.org/T192832) [20:53:43] !log redirect text-lb.eqiad pings to ping1001 on cr1/2-eqiad (24h tests) - T190090 [20:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:49] T190090: Offload pings to dedicated server - https://phabricator.wikimedia.org/T190090 [20:55:07] (03CR) 10Ppchelko: [C: 031] Move eventbus in deployment-prep to new stretch server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428517 (https://phabricator.wikimedia.org/T192832) (owner: 10Ottomata) [20:55:24] (03CR) 10Ppchelko: [C: 031] Move eventbus in deployment-prep to new stretch server [puppet] - 10https://gerrit.wikimedia.org/r/428519 (https://phabricator.wikimedia.org/T192832) (owner: 10Ottomata) [20:58:46] (03CR) 10Ottomata: [C: 032] Move eventbus in deployment-prep to new stretch server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428517 (https://phabricator.wikimedia.org/T192832) (owner: 10Ottomata) [20:58:49] (03CR) 10Ottomata: [C: 032] Move eventbus in deployment-prep to new stretch server [puppet] - 10https://gerrit.wikimedia.org/r/428519 (https://phabricator.wikimedia.org/T192832) (owner: 10Ottomata) [20:59:25] (03CR) 10jenkins-bot: Move eventbus in deployment-prep to new stretch server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428517 (https://phabricator.wikimedia.org/T192832) (owner: 10Ottomata) [20:59:35] thcipriani: did you finished with the wikiversion update? [21:00:04] bawolff and Reedy: How many deployers does it take to do Weekly Security deployment window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180423T2100). [21:01:00] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [21:02:19] (03PS1) 10Ottomata: Fix MirrorMaker alert dashboard alert url (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/428522 [21:02:40] Hauskatze: yes, wmf.30 should be everywhere now [21:02:43] (03CR) 10Ottomata: [V: 032 C: 032] Fix MirrorMaker alert dashboard alert url (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/428522 (owner: 10Ottomata) [21:03:04] thcipriani: thanks, do you know when the l10update will be running? [21:04:25] we have https://phabricator.wikimedia.org/T189127#4147678 et seq. and not sure how this will be fixed [21:04:25] I believe it's cronned to run ~2:30 UTC [21:04:46] ok, will check tomorrow then if the script has fixed the issue [21:06:32] from the looks of it Dereckson ran a full scap since the language file was added l10nupdate does mostly the same thing (along with some backports of l10n updates in master IIRC) [21:08:00] yes, I did [21:08:39] thcipriani: so, it won't fix it? [21:09:30] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [21:11:29] I don't think it will [21:12:17] Dereckson: so we should continue to investigate I guess [21:18:09] I'm not available this evening [21:19:16] no probs [21:25:59] !log restart elasticsearch on elastic1025 to apply numa settings [21:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:27] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: WDQS endpoint timeout - https://phabricator.wikimedia.org/T192759#4151679 (10Smalyshev) Looking at the logs on wdq1003, I see a string of `java.lang.OutOfMemoryError: unable to create new native th... [21:36:29] !log restart elasticsearch on elastic1024 to apply numa settings [21:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:01] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Manitowoc, Wisconsin) is CRITICAL: Test Get summary for Manitowoc, Wisconsin returned the unexpected status 500 (expecting: 200): /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve all events on January 15) is CRITICAL: Test retrieve all events on January 15 returned the unexpected status 500 [21:48:01] /{domain}/v1/page/media/{title}{/revision}{/tid} (retrieve media items of en.wp Cat page via media route) is CRITICAL: Test retrieve media items of en.wp Cat page via media route returned the unexpected status 500 (expecting: 200) [21:48:01] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 5 [21:48:01] : /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [uimage] [21:48:01] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [21:48:01] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [21:48:02] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page [21:48:02] {/revision} (Get references of a test page) is CRITICAL: Test Get references of a test page returned the unexpected status 500 (expecting: 200) [21:48:03] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed [21:48:03] mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/references/{title}{/revision} (Get referen [21:48:04] is CRITICAL: Test Get references of a test page returned the unexpected status 500 (expecting: 200) [21:48:04] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/references/{title}{/revision} (Get references of a test page) is CRITICAL: Test Get references of a test page returned the unexpected status 500 (expecting: 200) [21:48:05] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 500 (expect [21:48:05] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (e [21:48:20] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Manitowoc, Wisconsin) is CRITICAL: Test Get summary for Manitowoc, Wisconsin returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/references/{title}{/revision}{/tid} (retrieve structured reference data for the Cat article on English Wikipedia) is CRITICAL: Test retrieve structured r [21:48:21] he Cat article on English Wikipedia returned the unexpected status 500 (expecting: 200): /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve all events on January 15) is CRITICAL: Test retrieve all events on January 15 returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title}{/revision}{/tid} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: T [21:48:21] ection of en.wp Altrincham page via mobile-sections-lead returned the unexpected status 500 (expecting: 200) [21:49:01] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200) [21:49:01] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Manitowoc, Wisconsin) is CRITICAL: Test Get summary for Manitowoc, Wisconsin returned the unexpected status 500 (expecting: 200): /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve all events on January 15) is CRITICAL: Test retrieve all events on January 15 returned the unexpecte [21:49:01] ting: 200): /{domain}/v1/page/media/{title}{/revision}{/tid} (retrieve media items of en.wp Cat page via media route) is CRITICAL: Test retrieve media items of en.wp Cat page via media route returned the unexpected status 500 (expecting: 200) [21:49:16] duh [21:49:20] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (e [21:49:20] .wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 500 (expecting: 200) [21:49:20] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed [21:49:20] mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/references/{title}{/revision} (Get referen [21:49:20] is CRITICAL: Test Get references of a test page returned the unexpected status 500 (expecting: 200) [21:49:37] !log restart elasticsearch on elastic1028 to apply numa settings [21:49:40] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [21:49:40] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [21:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:50] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [21:49:50] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [21:50:00] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [21:50:10] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve [21:50:10] or Video article on English Wikipedia returned the unexpected status 500 (expecting: 200) [21:50:20] PROBLEM - cxserver endpoints health on scb1003 is CRITICAL: /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200) [21:51:01] PROBLEM - cxserver endpoints health on scb1004 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200) [21:51:01] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [21:51:10] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [21:52:10] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Manitowoc, Wisconsin) is CRITICAL: Test Get summary for Manitowoc, Wisconsin returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobile-sections/{title}{/revision}{/tid} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections re [21:52:10] ed status 500 (expecting: 200): /{domain}/v1/page/metadata/{title}{/revision}{/tid} (retrieve extended metadata for Video article on English Wikipedia) is CRITICAL: Test retrieve extended metadata for Video article on English Wikipedia returned the unexpected status 500 (expecting: 200): /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve all events on January 15) is CRITICAL: Test retrieve all events on January 15 retur [21:52:10] status 500 (expecting: 200): /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve the selected anniversaries for January 15) is CRITICAL: Test retrieve the selected anniversaries [21:52:20] PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_text site={eqiad,esams,ulsfo} https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:52:51] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [21:53:01] PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200): /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200) [21:53:30] PROBLEM - HTTP availability for Varnish on einsteinium is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [21:53:41] can someone in a real tz look at those? I assume they indcate an actual problem (?) [21:54:17] apergos: i was indeed able to get an error from the public restbase api [21:54:26] (although not on repeated requests) [21:56:21] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200): /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200) [21:56:21] RECOVERY - cxserver endpoints health on scb1003 is OK: All endpoints are healthy [21:56:21] Reedy: ah great, thanks! [21:57:00] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [21:57:10] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy [21:57:10] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [21:57:11] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [21:57:11] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [21:57:41] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [21:57:41] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [21:57:51] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [21:57:51] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [21:58:10] RECOVERY - cxserver endpoints health on scb1004 is OK: All endpoints are healthy [21:58:30] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [21:59:10] RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy [21:59:20] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [21:59:20] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [21:59:23] urandom: ^ not sure what's going on there, but you might [21:59:30] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [22:00:10] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [22:00:20] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/references/{title}{/revision} (Get references of a test page) is CRITICAL: Test Get references of a test page returned the unexpected statu [22:00:20] 00) [22:00:20] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body (AttributeError: NoneType object has no attribute get) [22:00:31] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [22:01:10] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/page/{sourcelanguage}/{targetlanguage}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200) [22:01:20] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [22:01:20] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [22:01:20] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [22:01:20] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [22:01:20] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [22:01:21] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [22:01:21] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [22:01:22] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [22:01:22] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [22:01:23] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [22:01:23] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [22:01:31] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [22:01:53] greg-g: whoa, looking... [22:02:10] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy [22:04:30] RECOVERY - HTTP availability for Varnish on einsteinium is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [22:05:20] RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:11:10] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [22:12:11] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [22:12:27] !log restart elasticsearch on elastic1029 to apply numa settings [22:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:51] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [22:17:48] 10Operations, 10netops: Enabling graceful-switchover causes core dumps on cr1-codfw - https://phabricator.wikimedia.org/T191371#4151763 (10ayounsi) Relevant KB entry: https://kb.juniper.net/InfoCenter/index?page=content&id=KB26616 JTAC's opinion on why it's working on some routers is that we're being "lucky".... [22:20:50] 10Operations, 10Puppet: puppetmaster puppet.conf refers to noexistent files - https://phabricator.wikimedia.org/T192848#4151767 (10Andrew) ok, I'll close this as a duplicate. Looks like today's the day to fix the issue though... I read the task but still don't quite understand why we don't just not set those... [22:21:05] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: puppetmaster hostcert and hostprivkey point to nonexistent files - https://phabricator.wikimedia.org/T179099#4151770 (10Andrew) [22:21:07] 10Operations, 10Puppet: puppetmaster puppet.conf refers to noexistent files - https://phabricator.wikimedia.org/T192848#4151772 (10Andrew) [22:21:17] !log restart elasticsearch on elastic1030 to apply numa settings [22:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:55] 10Operations, 10ops-eqiad, 10Cassandra, 10hardware-requests, and 2 others: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4151773 (10Eevans) All 3 instances are now decommissioned. @fgiunchedi we're ready when you are to have the disks swapped and... [22:23:44] herron: I have to go in a few minutes, but sometime soon I would appreciate some help on https://phabricator.wikimedia.org/T181523. Arturo and I have each burned a million hours on it, and I'm SURE it's just some stupid config mistake. [22:24:04] The same VM base image works with another puppetmaster which is (to the best of my knowledge) configured the same way :( [22:33:47] twentyafterfour: want me to check the cron on phab1001? yes, let's delete it and run puppet [22:36:50] PROBLEM - puppet last run on labvirt1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/field.sh] [22:37:50] !log phab1001 - deleting duplicate cronjob for public_taskdump.py (the one that did not output to /dev/null) (T188149) [22:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:57] T188149: Phlogiston reports don't have new data since mid-February - https://phabricator.wikimedia.org/T188149 [22:50:21] PROBLEM - HHVM rendering on mw2270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:51:20] RECOVERY - HHVM rendering on mw2270 is OK: HTTP OK: HTTP/1.1 200 OK - 79084 bytes in 0.324 second response time [22:56:24] !log disabling flapping VCP on asw1-eqsin - T192125 [22:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:30] T192125: asw1-eqsin vcp port flapping - https://phabricator.wikimedia.org/T192125 [22:58:56] (03PS1) 10Dzahn: phabricator: make dumps server configurable [puppet] - 10https://gerrit.wikimedia.org/r/428540 (https://phabricator.wikimedia.org/T188149) [22:59:43] 10Operations, 10netops: asw1-eqsin vcp port flapping - https://phabricator.wikimedia.org/T192125#4151846 (10ayounsi) Disabled with: `ayounsi@asw1-eqsin# run request virtual-chassis vc-port set interface member 1 vcp-255/0/25 disable` Should be re-enabled with: `# run request virtual-chassis vc-port set interf... [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180423T2300). [23:00:04] RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:17] I'll SWAT [23:00:21] I'm the only customer anyway [23:01:12] (03CR) 10Paladox: [C: 031] phabricator: make dumps server configurable [puppet] - 10https://gerrit.wikimedia.org/r/428540 (https://phabricator.wikimedia.org/T188149) (owner: 10Dzahn) [23:01:37] (03PS2) 10Catrope: Enable non-static internationalized maps on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428508 [23:01:41] (03CR) 10Catrope: [C: 032] Enable non-static internationalized maps on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428508 (owner: 10Catrope) [23:01:50] RECOVERY - puppet last run on labvirt1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:02:10] (03PS2) 10Dzahn: phabricator: make dumps server configurable, rsync to labstore1007 [puppet] - 10https://gerrit.wikimedia.org/r/428540 (https://phabricator.wikimedia.org/T188149) [23:03:09] (03CR) 10jerkins-bot: [V: 04-1] Enable non-static internationalized maps on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428508 (owner: 10Catrope) [23:03:11] (03CR) 10jerkins-bot: [V: 04-1] Enable non-static internationalized maps on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428508 (owner: 10Catrope) [23:04:49] (03CR) 10Catrope: [C: 032] Enable non-static internationalized maps on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428508 (owner: 10Catrope) [23:06:05] (03CR) 10jerkins-bot: [V: 04-1] Enable non-static internationalized maps on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428508 (owner: 10Catrope) [23:08:33] (03CR) 10Catrope: [C: 032] Enable non-static internationalized maps on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428508 (owner: 10Catrope) [23:09:59] (03CR) 10jerkins-bot: [V: 04-1] Enable non-static internationalized maps on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428508 (owner: 10Catrope) [23:11:15] !log restart elasticsearch on elastic1031 to apply numa settings [23:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:09] !log changed AMS-IX peering mode to default (filter on radb+rpki) [23:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:39] (03CR) 10Catrope: [C: 032] Enable non-static internationalized maps on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428508 (owner: 10Catrope) [23:29:07] 10Operations, 10Mail: move travel related aliases to OIT - https://phabricator.wikimedia.org/T127549#4151981 (10bbogaert) @Dzahn Can we try this again? We have made the Google and LDAP groups. I think we just have to wait for the "previous cache callout" to expire, then the mail will flow. Thanks, Byron [23:29:56] (03Merged) 10jenkins-bot: Enable non-static internationalized maps on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428508 (owner: 10Catrope) [23:30:11] (03CR) 10jenkins-bot: Enable non-static internationalized maps on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428508 (owner: 10Catrope) [23:32:19] !log catrope@tin Synchronized php-1.31.0-wmf.30/extensions/Thanks/includes/EchoCoreThanksPresentationModel.php: Fix fatal error in Thanks notifications (T192711) (duration: 00m 58s) [23:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:25] T192711: PHP fatal error on Special:Notifications: Argument 1 passed to EchoEventPresentationModel::getTruncatedTitleText() must be an instance of Title, null given - https://phabricator.wikimedia.org/T192711 [23:35:20] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable non-static internationalized maps on test2wiki (duration: 00m 59s) [23:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log