[00:05:42] 06Operations, 10Page-Previews, 06Performance-Team, 06Reading-Web-Backlog, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3200333 (10Nirzar) @Gilles Few things. Please, let's not use design principles out of context. >When you load an artic... [00:10:39] PROBLEM - tcpircbot_service_running on tegmen is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args tcpircbot.py [00:11:39] RECOVERY - tcpircbot_service_running on tegmen is OK: PROCS OK: 1 process with command name python, args tcpircbot.py [00:15:44] (03CR) 10EBernhardson: [C: 031] mwgrep: If --title is set, don't also require '*.js/.css' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/349351 (owner: 10Krinkle) [00:16:27] (03CR) 10EBernhardson: mwgrep: Add --etitle option (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/349352 (owner: 10Krinkle) [00:24:39] RECOVERY - DPKG on cp1008 is OK: All packages OK [00:48:59] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:50:49] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:57:59] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [00:58:49] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:58:59] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:06:32] so the phabricator traffic spikes keep happening: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=9&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1043&from=now-3h&to=now [01:07:01] not as bad as before but still pretty large. [01:10:15] (03CR) 10Krinkle: [C: 031] Force Labs to eqiad, since all the services are there. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349349 (https://phabricator.wikimedia.org/T163514) (owner: 10Mattflaschen) [01:30:59] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [01:34:49] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [01:39:49] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:40:59] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:55:49] 06Operations, 10DBA, 10Phabricator: Intermittent DB connectivity problem on phabricator, needs investigation - https://phabricator.wikimedia.org/T163507#3200523 (10mmodell) [02:10:10] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:16:39] PROBLEM - tcpircbot_service_running on tegmen is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args tcpircbot.py [02:18:39] RECOVERY - tcpircbot_service_running on tegmen is OK: PROCS OK: 1 process with command name python, args tcpircbot.py [02:20:14] Hey - is MariaDB locked? [02:23:33] seems to be --read-only for some reason [02:37:52] foks: what's up. is there a real problem with the site? [02:38:09] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [02:39:33] I'm trying to make a change to centralauth and getting: "The MariaDB server is running with the --read-only option so it cannot execute this statement" [02:39:53] i see lots of edits on en.wiki though [02:40:05] hmm [02:40:06] change to centralauth? [02:40:13] yeah [02:40:24] I'm removing 2FA from an account [02:40:28] it also seems to be quiet on wikipedia channel [02:40:37] (03CR) 10Krinkle: mwgrep: If --title is set, don't also require '*.js/.css' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/349351 (owner: 10Krinkle) [02:40:46] (03PS1) 10Dzahn: nagios_common: add performance team members to contact group [puppet] - 10https://gerrit.wikimedia.org/r/349372 (https://phabricator.wikimedia.org/T163432) [02:40:48] I'm not sure what the issue is then. might be syntax [02:40:56] * foks investigates [02:41:01] foks: so not like "call the dba" kind of thing, right [02:41:06] oh, sorry. no [02:41:12] Just that error [02:41:19] ok, just making sure when i saw "readonly" [02:41:20] (at least, I hope not) [02:41:22] and db [02:41:25] heh [02:41:26] but seems ok [02:41:27] Sorry to panic you [02:41:45] heh, it's ok, i'll move on, i just happened to open my laptop to check really quick [02:41:48] no worries [02:42:07] i just wanted to upload that one thing really quick [02:42:42] (03PS2) 10Dzahn: nagios_common: add performance team members to contact group [puppet] - 10https://gerrit.wikimedia.org/r/349372 (https://phabricator.wikimedia.org/T163432) [02:43:51] (03CR) 10Dzahn: [C: 032] "this should give you the permissions requested on https://phabricator.wikimedia.org/T163432" [puppet] - 10https://gerrit.wikimedia.org/r/349372 (https://phabricator.wikimedia.org/T163432) (owner: 10Dzahn) [02:44:38] (03CR) 10Dzahn: [V: 032 C: 032] nagios_common: add performance team members to contact group [puppet] - 10https://gerrit.wikimedia.org/r/349372 (https://phabricator.wikimedia.org/T163432) (owner: 10Dzahn) [02:48:23] mutante: yeah, looking at the db tree it actually looks like the real issue may just be that --write is not using the right database for that so it's logging into a read only slave [02:48:37] not an emergency :) [02:50:48] 06Operations, 06Performance-Team, 13Patch-For-Review: Access request to Icinga control panel to acknowledge Performance alerts - https://phabricator.wikimedia.org/T163432#3200579 (10Dzahn) @gilles @aaron @Peter @Krinkle see change above, it should give you the requested permissions. feel free to try it. when... [02:51:28] Jamesofur: thanks for confirming that :) [02:51:39] and the change i wanted is done. so cu you later.. off [02:51:49] o7 [02:51:53] sleep well :) [02:52:00] thx, u2 [02:53:11] (03CR) 10Krinkle: [C: 04-1] mwgrep: Add --etitle option [puppet] - 10https://gerrit.wikimedia.org/r/349352 (owner: 10Krinkle) [03:24:46] PROBLEM - tcpircbot_service_running on tegmen is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args tcpircbot.py [03:25:46] RECOVERY - tcpircbot_service_running on tegmen is OK: PROCS OK: 1 process with command name python, args tcpircbot.py [03:50:46] PROBLEM - Check Varnish expiry mailbox lag on cp2002 is CRITICAL: CRITICAL: expiry mailbox lag is 600208 [03:55:36] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 288 bytes in 0.085 second response time [04:11:16] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=303.80 Read Requests/Sec=1270.90 Write Requests/Sec=0.60 KBytes Read/Sec=42342.40 KBytes_Written/Sec=46.80 [04:18:26] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=7.10 Read Requests/Sec=5.00 Write Requests/Sec=2.50 KBytes Read/Sec=20.40 KBytes_Written/Sec=62.40 [04:21:34] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=341.90 Read Requests/Sec=490.50 Write Requests/Sec=55.70 KBytes Read/Sec=7911.20 KBytes_Written/Sec=548.40 [04:22:24] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=19.60 Read Requests/Sec=0.50 Write Requests/Sec=0.80 KBytes Read/Sec=37.96 KBytes_Written/Sec=16.38 [04:35:44] PROBLEM - tcpircbot_service_running on tegmen is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args tcpircbot.py [04:36:44] RECOVERY - tcpircbot_service_running on tegmen is OK: PROCS OK: 1 process with command name python, args tcpircbot.py [04:40:44] RECOVERY - Check Varnish expiry mailbox lag on cp2002 is OK: OK: expiry mailbox lag is 30 [05:14:54] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1895 bytes in 0.101 second response time [05:19:54] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1882 bytes in 0.126 second response time [06:00:17] (03PS1) 10Marostegui: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349377 (https://phabricator.wikimedia.org/T132416) [06:04:06] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349377 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [06:05:21] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349377 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [06:05:42] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349377 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [06:09:23] !log Deploy alter table enwiki.revision db1067 - T132416 [06:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:33] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [06:16:03] 06Operations, 10Phabricator: Intermittent DB connectivity problem on phabricator, needs investigation - https://phabricator.wikimedia.org/T163507#3199829 (10Marostegui) As @mmodell says, the databases never crashed, they just had spikes as he posted in the graph and can also be seen on: https://grafana.wikimed... [06:21:03] !log Restart MySQL on db1065 for maintenance - T163351 [06:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:12] T163351: codfw API slaves overloaded during the 2017-04-19 codfw switch - https://phabricator.wikimedia.org/T163351 [06:59:16] (03PS1) 10Marostegui: db-eqiad.php: Move db1063 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349378 (https://phabricator.wikimedia.org/T163109) [07:03:34] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [07:03:44] PROBLEM - tcpircbot_service_running on tegmen is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args tcpircbot.py [07:04:55] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [07:05:44] RECOVERY - tcpircbot_service_running on tegmen is OK: PROCS OK: 1 process with command name python, args tcpircbot.py [07:06:44] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed [07:07:04] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused [07:07:04] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:09:54] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:10:44] PROBLEM - tcpircbot_service_running on tegmen is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args tcpircbot.py [07:11:44] RECOVERY - tcpircbot_service_running on tegmen is OK: PROCS OK: 1 process with command name python, args tcpircbot.py [07:15:01] <_joe_> the etcdmirror thing is my fault, I'm playing with it [07:15:11] <_joe_> but tcpircbot needs fixing :) [07:16:11] ah good thanks :D [07:18:04] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [07:18:21] (03PS1) 10Giuseppe Lavagetto: Add separated SRV records for etcd to consume for conftool [dns] - 10https://gerrit.wikimedia.org/r/349380 (https://phabricator.wikimedia.org/T159687) [07:22:34] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:30:27] (03PS1) 10Marostegui: site.pp: Move db1063 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/349381 (https://phabricator.wikimedia.org/T163109) [07:31:04] PROBLEM - Check Varnish expiry mailbox lag on cp2024 is CRITICAL: CRITICAL: expiry mailbox lag is 654458 [07:31:05] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Move db1063 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349378 (https://phabricator.wikimedia.org/T163109) (owner: 10Marostegui) [07:32:15] (03Merged) 10jenkins-bot: db-eqiad.php: Move db1063 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349378 (https://phabricator.wikimedia.org/T163109) (owner: 10Marostegui) [07:32:27] (03CR) 10jenkins-bot: db-eqiad.php: Move db1063 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349378 (https://phabricator.wikimedia.org/T163109) (owner: 10Marostegui) [07:39:21] 06Operations, 10DBA, 13Patch-For-Review, 05codfw-rollout: codfw API slaves overloaded during the 2017-04-19 codfw switch - https://phabricator.wikimedia.org/T163351#3200825 (10jcrespo) [07:41:11] (03PS1) 10Marostegui: db-eqiad.php: Depool db1070 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349382 (https://phabricator.wikimedia.org/T163109) [07:42:35] (03PS2) 10Marostegui: db-eqiad.php: Depool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349382 (https://phabricator.wikimedia.org/T163109) [07:43:27] !log installing further icu security updates [07:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:04] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349382 (https://phabricator.wikimedia.org/T163109) (owner: 10Marostegui) [07:45:27] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349382 (https://phabricator.wikimedia.org/T163109) (owner: 10Marostegui) [07:45:48] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349382 (https://phabricator.wikimedia.org/T163109) (owner: 10Marostegui) [07:47:53] !log Stop MySQL on db1071 and db1063 to reclone db1063 - T163109 [07:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:01] T163109: Reclone db1063 to become a slave in s5 - https://phabricator.wikimedia.org/T163109 [07:48:44] PROBLEM - nova-compute process on labvirt1009 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [07:49:44] RECOVERY - nova-compute process on labvirt1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [07:55:01] (03PS4) 10Marostegui: templates/wmnet: Switch dns master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/348440 (https://phabricator.wikimedia.org/T155099) [07:58:54] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [08:00:33] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/6192/" [puppet] - 10https://gerrit.wikimedia.org/r/349381 (https://phabricator.wikimedia.org/T163109) (owner: 10Marostegui) [08:00:54] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [08:01:36] !log rolling restart of hhvm on application servers in eqiad to pick up ICU security update [08:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:31] (03CR) 10Hashar: salt: fix grain-ensure comparison (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/348928 (https://phabricator.wikimedia.org/T146914) (owner: 10Hashar) [08:08:37] (03PS2) 10Hashar: salt: fix grain-ensure comparison [puppet] - 10https://gerrit.wikimedia.org/r/348928 (https://phabricator.wikimedia.org/T146914) [08:09:44] PROBLEM - tcpircbot_service_running on tegmen is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args tcpircbot.py [08:10:45] RECOVERY - tcpircbot_service_running on tegmen is OK: PROCS OK: 1 process with command name python, args tcpircbot.py [08:12:45] (03PS1) 10Giuseppe Lavagetto: hieradata: remove etcd_hosts, unused [puppet] - 10https://gerrit.wikimedia.org/r/349384 [08:12:46] (03PS1) 10Giuseppe Lavagetto: role::configcluster: reconfigure etcd replication [puppet] - 10https://gerrit.wikimedia.org/r/349385 (https://phabricator.wikimedia.org/T159687) [08:12:49] (03PS1) 10Giuseppe Lavagetto: etcd: make our rw clients use the new SRV record [puppet] - 10https://gerrit.wikimedia.org/r/349386 (https://phabricator.wikimedia.org/T159687) [08:13:06] (03CR) 10Marostegui: [C: 032] templates/wmnet: Switch dns master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/348440 (https://phabricator.wikimedia.org/T155099) (owner: 10Marostegui) [08:14:06] (03CR) 10Hashar: "Cherry picked again on beta cluster puppet master." [puppet] - 10https://gerrit.wikimedia.org/r/348928 (https://phabricator.wikimedia.org/T146914) (owner: 10Hashar) [08:16:24] 06Operations, 07Beta-Cluster-reproducible, 05MW-1.29-release (WMF-deploy-2017-04-25_(1.29.0-wmf.21)), 05MW-1.29-release-notes, and 2 others: firejail for mediawiki converter leaks to stderr: "Reading profile /etc/firejail/mediawiki-converters.profile" - https://phabricator.wikimedia.org/T158649#3200877 (10h... [08:16:36] (03PS1) 10Marostegui: s2,5.hosts: Move db1063 from s2 to s5 [software] - 10https://gerrit.wikimedia.org/r/349387 (https://phabricator.wikimedia.org/T163109) [08:20:39] (03CR) 10Marostegui: [C: 032] s2,5.hosts: Move db1063 from s2 to s5 [software] - 10https://gerrit.wikimedia.org/r/349387 (https://phabricator.wikimedia.org/T163109) (owner: 10Marostegui) [08:20:53] !log rolling restart of aqs (nodejs) on aqs* to pick up upgrades [08:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:34] (03Merged) 10jenkins-bot: s2,5.hosts: Move db1063 from s2 to s5 [software] - 10https://gerrit.wikimedia.org/r/349387 (https://phabricator.wikimedia.org/T163109) (owner: 10Marostegui) [08:22:12] (03PS3) 10Muehlenhoff: Load nf_conntrack via /etc/modules-load.d/ [puppet] - 10https://gerrit.wikimedia.org/r/349193 (https://phabricator.wikimedia.org/T136094) [08:28:47] (03PS1) 10Jcrespo: Depool db2055 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349388 (https://phabricator.wikimedia.org/T116557) [08:30:28] (03PS2) 10Jcrespo: MariaDB: Depool db2062 database server for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349388 (https://phabricator.wikimedia.org/T116557) [08:30:36] (03CR) 10Marostegui: "Title says db2055 but you are depooling db2062, is that intended?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349388 (https://phabricator.wikimedia.org/T116557) (owner: 10Jcrespo) [08:30:44] <_joe_> logmsgbot: ping [08:30:52] (03CR) 10Marostegui: [C: 031] MariaDB: Depool db2062 database server for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349388 (https://phabricator.wikimedia.org/T116557) (owner: 10Jcrespo) [08:31:04] RECOVERY - Check Varnish expiry mailbox lag on cp2024 is OK: OK: expiry mailbox lag is 328 [08:31:11] (03CR) 10Jcrespo: [C: 032] MariaDB: Depool db2062 database server for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349388 (https://phabricator.wikimedia.org/T116557) (owner: 10Jcrespo) [08:32:10] (03Merged) 10jenkins-bot: MariaDB: Depool db2062 database server for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349388 (https://phabricator.wikimedia.org/T116557) (owner: 10Jcrespo) [08:32:22] (03CR) 10jenkins-bot: MariaDB: Depool db2062 database server for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349388 (https://phabricator.wikimedia.org/T116557) (owner: 10Jcrespo) [08:32:44] !log looking at tcpircbot (logmsgbot) problems at tegmen [08:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:47] test [08:34:49]  [08:36:04] (03CR) 10Muehlenhoff: [C: 032] Load nf_conntrack via /etc/modules-load.d/ [puppet] - 10https://gerrit.wikimedia.org/r/349193 (https://phabricator.wikimedia.org/T136094) (owner: 10Muehlenhoff) [08:37:43] (03CR) 10Marostegui: [C: 032] site.pp: Move db1063 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/349381 (https://phabricator.wikimedia.org/T163109) (owner: 10Marostegui) [08:37:50] (03PS2) 10Marostegui: site.pp: Move db1063 to s5 [puppet] - 10https://gerrit.wikimedia.org/r/349381 (https://phabricator.wikimedia.org/T163109) [08:38:18] (03PS1) 10Ema: cache_upload: increase small object threshold to 1024B [puppet] - 10https://gerrit.wikimedia.org/r/349389 (https://phabricator.wikimedia.org/T145661) [08:38:33] (03PS2) 10Giuseppe Lavagetto: Add separated SRV records for etcd to consume for conftool [dns] - 10https://gerrit.wikimedia.org/r/349380 (https://phabricator.wikimedia.org/T159687) [08:38:42] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3200902 (10jcrespo) [08:38:47] 06Operations, 07Availability: Set databases as read-only or switchover to secondary datacenter - https://phabricator.wikimedia.org/T138810#3200898 (10jcrespo) 05Open>03Resolved a:03jcrespo [08:39:44] PROBLEM - puppet last run on db1036 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:40:14] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:41:04] PROBLEM - puppet last run on ms-be2012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:41:14] PROBLEM - puppet last run on es2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:41:14] PROBLEM - puppet last run on ms-be1016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:41:24] PROBLEM - puppet last run on ms-be2019 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:41:44] PROBLEM - puppet last run on db1069 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:41:44] PROBLEM - puppet last run on ms-be1012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:41:50] Error: Could not set 'present' on ensure: No such file or directory - /etc/modules-load.d/conntrack.conf20170421-12138-ca8j01.lock at 27:/etc/puppet/modules/ferm/manifests/init.pp [08:41:54] moritzm: ^ [08:42:04] PROBLEM - puppet last run on ms-be2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:42:04] PROBLEM - puppet last run on db2048 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:42:24] PROBLEM - puppet last run on ms-be2005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:42:40] <_joe_> moritzm: ^^ [08:43:14] PROBLEM - puppet last run on ms-be2007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:43:34] PROBLEM - puppet last run on mw2246 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:43:44] PROBLEM - puppet last run on es2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:43:45] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:43:45] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:43:45] PROBLEM - puppet last run on californium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:44:04] PROBLEM - puppet last run on ms-be2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:44:24] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:44:41] reverting [08:44:44] PROBLEM - puppet last run on db1044 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:45:04] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:45:45] PROBLEM - puppet last run on db1031 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:46:04] PROBLEM - puppet last run on ms-be2006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:46:04] PROBLEM - puppet last run on ms-be2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:46:28] moritzm: there's apparently no /etc/modules-load.d/ on Ubuntu [08:46:44] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:46:44] PROBLEM - puppet last run on db1050 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:47:17] (03PS1) 10Muehlenhoff: Revert Load nf_conntrack via /etc/modules-load.d/ [puppet] - 10https://gerrit.wikimedia.org/r/349390 [08:47:31] ema: yeah, all my test systems were on jessie, I'm reverting and will fix that up in a followup commit [08:47:44] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:47:44] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:48:44] PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:48:54] (03CR) 10Muehlenhoff: [C: 032] Revert Load nf_conntrack via /etc/modules-load.d/ [puppet] - 10https://gerrit.wikimedia.org/r/349390 (owner: 10Muehlenhoff) [08:49:12] !log jynus@naos Synchronized wmf-config/db-codfw.php: Depool db2062 (duration: 01m 20s) [08:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:24] PROBLEM - puppet last run on db2029 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:49:44] (03PS1) 10Alexandros Kosiaris: role::tcpircbot: Pass ensure parameter to tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/349391 [08:50:04] PROBLEM - puppet last run on ms-be2009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:50:04] PROBLEM - puppet last run on db2038 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:50:44] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [08:50:45] PROBLEM - puppet last run on ocg1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:50:45] PROBLEM - puppet last run on db1026 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:50:45] PROBLEM - puppet last run on ms-be1021 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:50:45] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:50:54] PROBLEM - puppet last run on mw2152 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:51:04] RECOVERY - puppet last run on ms-be2009 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [08:51:04] RECOVERY - puppet last run on ms-be2001 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [08:51:04] RECOVERY - puppet last run on ms-be2004 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [08:51:21] PROBLEM - puppet last run on ms-be2010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:51:21] RECOVERY - puppet last run on ms-be2007 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [08:51:21] PROBLEM - puppet last run on db2023 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:51:21] PROBLEM - puppet last run on ms-be2014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:51:22] RECOVERY - puppet last run on ms-be2005 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [08:52:01] PROBLEM - puppet last run on db2041 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:52:01] RECOVERY - puppet last run on ms-be2002 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [08:52:01] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [08:52:01] RECOVERY - puppet last run on ms-be2006 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [08:52:01] PROBLEM - puppet last run on db2037 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:52:21] RECOVERY - puppet last run on ms-be2019 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [08:52:21] RECOVERY - puppet last run on ms-be2014 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [08:52:51] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/modules-load.d/conntrack.conf] [08:53:01] RECOVERY - puppet last run on ms-be2012 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [08:53:11] RECOVERY - puppet last run on ms-be2010 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [08:53:51] RECOVERY - puppet last run on ocg1002 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [08:53:51] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [08:53:51] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [08:54:11] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [08:54:21] RECOVERY - puppet last run on ms-be1016 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [08:54:21] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [08:54:52] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [08:54:52] RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [08:54:52] RECOVERY - puppet last run on ms-be1012 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [08:55:01] RECOVERY - puppet last run on es2004 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [08:55:41] RECOVERY - puppet last run on es2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:55:51] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [08:56:03] I am disabling notifications on db1063 [08:56:08] for the new checks [08:56:34] thanks jynus [08:56:48] i didn't realise that they will be dropped+created after the puppet change to move it to another shard [08:56:51] RECOVERY - puppet last run on db1044 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [08:56:51] RECOVERY - puppet last run on db1026 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [08:56:51] RECOVERY - puppet last run on db1050 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [08:57:01] RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [08:57:29] marostegui, it normaly doesn't, but the name changes sX -> s5 [08:57:43] and the name is there for easy searching and because multi-source [08:57:51] RECOVERY - puppet last run on db1069 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [08:57:51] RECOVERY - puppet last run on ms-be1021 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [08:57:51] RECOVERY - puppet last run on db1031 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [08:57:51] RECOVERY - puppet last run on db1036 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [08:57:51] RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [08:58:21] RECOVERY - puppet last run on db2023 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [08:58:22] RECOVERY - puppet last run on db2029 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [08:58:51] yeah, it is helpful actually [08:58:51] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [08:59:11] RECOVERY - puppet last run on db2038 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [08:59:11] RECOVERY - puppet last run on db2037 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [08:59:41] RECOVERY - puppet last run on mw2246 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:00:01] RECOVERY - puppet last run on db2041 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [09:00:11] RECOVERY - puppet last run on db2048 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [09:00:51] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [09:01:06] 06Operations, 10Page-Previews, 06Performance-Team, 06Reading-Web-Backlog, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3200916 (10Gilles) I'm sorry to say, but what operating systems do, what other websites do, is completely irrelevant. You... [09:02:02] (03PS2) 10Elukey: Refactor role::piwik in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/348938 (https://phabricator.wikimedia.org/T159136) [09:02:57] (03CR) 10jerkins-bot: [V: 04-1] Refactor role::piwik in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/348938 (https://phabricator.wikimedia.org/T159136) (owner: 10Elukey) [09:03:39] (03PS1) 10Muehlenhoff: Load nf_conntrack via /etc/modules-load.d/ [puppet] - 10https://gerrit.wikimedia.org/r/349392 (https://phabricator.wikimedia.org/T136094) [09:04:04] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/6194/ says fine, merging" [puppet] - 10https://gerrit.wikimedia.org/r/349391 (owner: 10Alexandros Kosiaris) [09:04:09] (03PS2) 10Alexandros Kosiaris: role::tcpircbot: Pass ensure parameter to tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/349391 [09:04:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] role::tcpircbot: Pass ensure parameter to tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/349391 (owner: 10Alexandros Kosiaris) [09:10:09] 06Operations, 06Performance-Team, 13Patch-For-Review: Access request to Icinga control panel to acknowledge Performance alerts - https://phabricator.wikimedia.org/T163432#3200920 (10Gilles) 05Open>03Resolved a:03Gilles Seems to work, I was able to post a comment, which was blocked for me before. Thanks! [09:10:20] !log stopping and upgrading/reconfiguring db2062 (depooled) T116557 [09:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:28] T116557: AFComputedVariable::compute query timeouts - https://phabricator.wikimedia.org/T116557 [09:15:52] (03PS2) 10Ema: cache_upload: increase small object threshold to 1024B [puppet] - 10https://gerrit.wikimedia.org/r/349389 (https://phabricator.wikimedia.org/T145661) [09:17:29] (03PS2) 10Muehlenhoff: Load nf_conntrack via /etc/modules-load.d/ [puppet] - 10https://gerrit.wikimedia.org/r/349392 (https://phabricator.wikimedia.org/T136094) [09:18:19] (03CR) 10Ema: [C: 031] Load nf_conntrack via /etc/modules-load.d/ [puppet] - 10https://gerrit.wikimedia.org/r/349392 (https://phabricator.wikimedia.org/T136094) (owner: 10Muehlenhoff) [09:18:40] (03PS3) 10Elukey: Refactor role::piwik in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/348938 (https://phabricator.wikimedia.org/T159136) [09:20:14] !log rebooting etherpad1001 (running etherpad.wikimedia.org) for update to Linux 4.9 [09:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:51] (03PS4) 10Elukey: Refactor role::piwik in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/348938 (https://phabricator.wikimedia.org/T159136) [09:23:14] (03PS3) 10Muehlenhoff: Load nf_conntrack via /etc/modules-load.d/ [puppet] - 10https://gerrit.wikimedia.org/r/349392 (https://phabricator.wikimedia.org/T136094) [09:27:44] 06Operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 13Patch-For-Review, and 2 others: jobqueue is full of refreshlinks duplicates after the switchover. - https://phabricator.wikimedia.org/T163418#3200942 (10Joe) @GWicke I have seen the same job being re-executed multiple times (after succeeding) whe... [09:28:46] (03CR) 10Muehlenhoff: [C: 032] Load nf_conntrack via /etc/modules-load.d/ [puppet] - 10https://gerrit.wikimedia.org/r/349392 (https://phabricator.wikimedia.org/T136094) (owner: 10Muehlenhoff) [09:35:59] (03PS1) 10Giuseppe Lavagetto: Do not restart HHVM on jobrunners [switchdc] - 10https://gerrit.wikimedia.org/r/349395 (https://phabricator.wikimedia.org/T163337) [09:43:13] (03CR) 10Volans: [C: 031] "LGTM" [switchdc] - 10https://gerrit.wikimedia.org/r/349395 (https://phabricator.wikimedia.org/T163337) (owner: 10Giuseppe Lavagetto) [09:45:06] (03Abandoned) 10Muehlenhoff: Load connection tracking sysctl values via a separate systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/320197 (https://phabricator.wikimedia.org/T136094) (owner: 10Muehlenhoff) [09:51:25] (03PS1) 10Muehlenhoff: Revert "Create a separate sysctl configuration for setting conntrack settings" [puppet] - 10https://gerrit.wikimedia.org/r/349396 [10:01:08] (03CR) 10Alexandros Kosiaris: [C: 032] Fix sync-icinga-state cron presence/absence [puppet] - 10https://gerrit.wikimedia.org/r/349203 (owner: 10Alexandros Kosiaris) [10:01:18] (03PS2) 10Alexandros Kosiaris: Fix sync-icinga-state cron presence/absence [puppet] - 10https://gerrit.wikimedia.org/r/349203 [10:01:24] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix sync-icinga-state cron presence/absence [puppet] - 10https://gerrit.wikimedia.org/r/349203 (owner: 10Alexandros Kosiaris) [10:02:04] (03PS2) 10Alexandros Kosiaris: puppetmaster: Depool puppetmaster1002 [puppet] - 10https://gerrit.wikimedia.org/r/349164 (https://phabricator.wikimedia.org/T148506) [10:02:09] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] puppetmaster: Depool puppetmaster1002 [puppet] - 10https://gerrit.wikimedia.org/r/349164 (https://phabricator.wikimedia.org/T148506) (owner: 10Alexandros Kosiaris) [10:10:21] PROBLEM - puppet last run on db1087 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:10:51] PROBLEM - puppet last run on analytics1066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:10:51] PROBLEM - puppet last run on thumbor1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:10:51] PROBLEM - puppet last run on elastic1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:10:51] PROBLEM - puppet last run on db1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:10:51] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:11:12] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:11:21] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:11:41] PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:11:51] PROBLEM - puppet last run on db1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:11:52] PROBLEM - puppet last run on elastic1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:11:52] PROBLEM - puppet last run on es1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:11:52] PROBLEM - puppet last run on es1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:11:52] PROBLEM - puppet last run on dbproxy1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:11:52] PROBLEM - puppet last run on mw1227 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:11:52] PROBLEM - puppet last run on elastic1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:11:56] that's interesting [10:12:11] PROBLEM - puppet last run on cp1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:12:11] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:12:27] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Failed to submit 'replace facts' command for mw1182.eqiad.wmnet to PuppetDB at nitrogen.eqiad.wmnet:443: execution expired [10:12:44] maybe they were in the middle of a run? [10:12:51] PROBLEM - puppet last run on dbproxy1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:12:51] PROBLEM - puppet last run on es1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:13:31] PROBLEM - puppet last run on mc1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:13:43] probably related to me depooling puppetmaster1002 ? [10:13:51] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:13:51] PROBLEM - puppet last run on prometheus1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:13:51] PROBLEM - puppet last run on db1069 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:14:11] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:14:51] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:14:53] PROBLEM - puppet last run on elastic1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:14:53] PROBLEM - puppet last run on darmstadtium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:14:53] PROBLEM - puppet last run on etcd1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:14:53] PROBLEM - puppet last run on mw1283 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:14:53] PROBLEM - puppet last run on mw1269 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:14:53] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:14:53] PROBLEM - puppet last run on elastic1024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:14:53] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:14:54] PROBLEM - puppet last run on mc1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:15:30] (03PS1) 10Alexandros Kosiaris: Revert "puppetmaster: Depool puppetmaster1002" [puppet] - 10https://gerrit.wikimedia.org/r/349399 [10:15:31] PROBLEM - puppet last run on mc1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:15:31] PROBLEM - puppet last run on mc1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:15:31] PROBLEM - puppet last run on mw1243 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:15:35] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "puppetmaster: Depool puppetmaster1002" [puppet] - 10https://gerrit.wikimedia.org/r/349399 (owner: 10Alexandros Kosiaris) [10:15:51] PROBLEM - puppet last run on ms-be1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:15:51] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:15:51] PROBLEM - puppet last run on mw1197 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:16:01] PROBLEM - puppet last run on dbproxy1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:16:11] PROBLEM - puppet last run on mw1271 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:16:31] PROBLEM - puppet last run on mwdebug1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:16:41] RECOVERY - puppet last run on mw1182 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [10:16:51] PROBLEM - puppet last run on mw1254 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:16:51] PROBLEM - puppet last run on db1083 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:16:51] PROBLEM - puppet last run on argon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:17:31] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:17:41] PROBLEM - puppet last run on mw1290 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:17:52] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:17:52] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:17:52] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:17:52] PROBLEM - puppet last run on rdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:17:52] PROBLEM - puppet last run on mw1241 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:17:52] PROBLEM - puppet last run on labsdb1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:18:21] PROBLEM - puppet last run on labstore1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:18:31] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:18:31] PROBLEM - puppet last run on mw1204 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:18:31] PROBLEM - puppet last run on lvs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:18:51] PROBLEM - puppet last run on db1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:18:51] PROBLEM - puppet last run on copper is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:18:51] PROBLEM - puppet last run on db1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:18:51] PROBLEM - puppet last run on wtp1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:18:51] RECOVERY - puppet last run on es1017 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [10:18:52] PROBLEM - puppet last run on mw1266 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:18:52] PROBLEM - puppet last run on db1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:18:56] ok fixed [10:19:06] recoveries should start to flow in [10:19:11] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:19:11] PROBLEM - puppet last run on dbproxy1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:19:31] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:19:41] RECOVERY - puppet last run on mw1290 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [10:19:51] PROBLEM - puppet last run on mw1298 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:19:51] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:19:51] PROBLEM - puppet last run on elastic1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:19:51] PROBLEM - puppet last run on mw1285 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:19:51] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:20:31] PROBLEM - puppet last run on mw1194 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:20:51] PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:20:51] PROBLEM - puppet last run on db1079 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:20:51] PROBLEM - puppet last run on mc1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:20:51] PROBLEM - puppet last run on wtp1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:20:51] PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:20:52] PROBLEM - puppet last run on mw1206 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:20:52] PROBLEM - puppet last run on cp1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:21:31] PROBLEM - puppet last run on mw1230 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:21:51] PROBLEM - puppet last run on cp1099 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:21:51] PROBLEM - puppet last run on restbase1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:21:51] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:21:51] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:22:01] PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:22:01] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:23:21] (03PS1) 10Alexandros Kosiaris: Partially Revert "Revert "puppetmaster: Depool puppetmaster1002"" [puppet] - 10https://gerrit.wikimedia.org/r/349400 [10:23:47] (03PS2) 10Alexandros Kosiaris: Partially Revert "Revert "puppetmaster: Depool puppetmaster1002"" [puppet] - 10https://gerrit.wikimedia.org/r/349400 [10:24:07] (03CR) 10Alexandros Kosiaris: [C: 032] Partially Revert "Revert "puppetmaster: Depool puppetmaster1002"" [puppet] - 10https://gerrit.wikimedia.org/r/349400 (owner: 10Alexandros Kosiaris) [10:24:10] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Partially Revert "Revert "puppetmaster: Depool puppetmaster1002"" [puppet] - 10https://gerrit.wikimedia.org/r/349400 (owner: 10Alexandros Kosiaris) [10:24:28] (03CR) 10Giuseppe Lavagetto: [C: 032] Do not restart HHVM on jobrunners [switchdc] - 10https://gerrit.wikimedia.org/r/349395 (https://phabricator.wikimedia.org/T163337) (owner: 10Giuseppe Lavagetto) [10:25:51] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [10:26:58] (03PS1) 10Marostegui: db-eqiad.php: Repool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349402 (https://phabricator.wikimedia.org/T163109) [10:29:32] (03CR) 10Hashar: "Indeed the run time goes from 5 seconds to 1minute 10 on my machine. That is most probably due to the global space being filled via boots" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349210 (owner: 10Hashar) [10:29:56] (03PS1) 10Alexandros Kosiaris: Revert "Partially Revert "Revert "puppetmaster: Depool puppetmaster1002""" [puppet] - 10https://gerrit.wikimedia.org/r/349405 [10:30:21] (03PS2) 10Alexandros Kosiaris: Revert "Partially Revert "Revert "puppetmaster: Depool puppetmaster1002""" [puppet] - 10https://gerrit.wikimedia.org/r/349405 [10:30:27] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "Partially Revert "Revert "puppetmaster: Depool puppetmaster1002""" [puppet] - 10https://gerrit.wikimedia.org/r/349405 (owner: 10Alexandros Kosiaris) [10:31:46] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349402 (https://phabricator.wikimedia.org/T163109) (owner: 10Marostegui) [10:34:07] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349402 (https://phabricator.wikimedia.org/T163109) (owner: 10Marostegui) [10:35:36] !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Repool db1071 - T163109 (duration: 01m 20s) [10:35:44] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349402 (https://phabricator.wikimedia.org/T163109) (owner: 10Marostegui) [10:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:46] T163109: Reclone db1063 to become a slave in s5 - https://phabricator.wikimedia.org/T163109 [10:35:51] RECOVERY - puppet last run on elastic1030 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [10:36:21] RECOVERY - puppet last run on db1087 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [10:36:51] RECOVERY - puppet last run on analytics1066 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [10:36:51] RECOVERY - puppet last run on es1015 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [10:36:51] RECOVERY - puppet last run on thumbor1001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [10:36:51] RECOVERY - puppet last run on db1052 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:37:11] PROBLEM - puppet last run on multatuli is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:37:21] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:37:21] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [10:37:51] RECOVERY - puppet last run on elastic1032 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [10:37:51] RECOVERY - puppet last run on dbproxy1007 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [10:37:51] RECOVERY - puppet last run on mw1227 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [10:38:01] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [10:38:11] RECOVERY - puppet last run on cp1050 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [10:38:51] RECOVERY - puppet last run on db1036 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [10:38:51] RECOVERY - puppet last run on es1012 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [10:38:51] RECOVERY - puppet last run on elastic1019 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [10:38:52] RECOVERY - puppet last run on elastic1024 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [10:39:11] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [10:39:51] PROBLEM - Check Varnish expiry mailbox lag on cp2002 is CRITICAL: CRITICAL: expiry mailbox lag is 685494 [10:39:51] RECOVERY - puppet last run on ms-be1012 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [10:39:51] RECOVERY - puppet last run on dbproxy1004 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [10:39:51] RECOVERY - puppet last run on elastic1036 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [10:39:51] RECOVERY - puppet last run on db1069 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [10:39:52] RECOVERY - puppet last run on mw1269 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [10:40:31] RECOVERY - puppet last run on mc1001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [10:40:31] RECOVERY - puppet last run on mwdebug1001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [10:40:51] RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [10:40:51] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [10:40:51] RECOVERY - puppet last run on db1083 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [10:40:51] RECOVERY - puppet last run on prometheus1003 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [10:40:51] RECOVERY - puppet last run on darmstadtium is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [10:40:52] RECOVERY - puppet last run on mw1283 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [10:40:52] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [10:41:11] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [10:41:31] RECOVERY - puppet last run on mc1009 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [10:41:31] RECOVERY - puppet last run on mc1007 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [10:41:51] RECOVERY - puppet last run on wtp1007 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [10:41:51] RECOVERY - puppet last run on etcd1005 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [10:41:51] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [10:41:51] RECOVERY - puppet last run on mc1004 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [10:41:52] RECOVERY - puppet last run on labsdb1011 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [10:42:01] RECOVERY - puppet last run on dbproxy1002 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [10:42:31] RECOVERY - puppet last run on mw1243 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [10:42:51] RECOVERY - puppet last run on mw1266 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [10:42:51] RECOVERY - puppet last run on rdb1003 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [10:42:51] RECOVERY - puppet last run on mw1197 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [10:42:58] (03PS1) 10Muehlenhoff: Add debdeploy salt grains for failoid [puppet] - 10https://gerrit.wikimedia.org/r/349408 [10:43:11] RECOVERY - puppet last run on mw1271 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [10:43:11] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [10:43:51] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [10:43:51] RECOVERY - puppet last run on db1022 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [10:43:51] RECOVERY - puppet last run on mw1254 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [10:43:51] RECOVERY - puppet last run on argon is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [10:43:52] RECOVERY - puppet last run on db1033 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [10:44:21] RECOVERY - puppet last run on labstore1005 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [10:44:31] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [10:44:31] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [10:44:31] RECOVERY - puppet last run on lvs1005 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [10:44:51] RECOVERY - puppet last run on db1050 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [10:44:51] RECOVERY - puppet last run on copper is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [10:44:51] RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [10:44:51] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [10:44:51] RECOVERY - puppet last run on mw1241 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [10:45:12] RECOVERY - puppet last run on dbproxy1011 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [10:45:31] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [10:45:31] RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [10:45:51] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [10:45:51] RECOVERY - puppet last run on wtp1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [10:45:51] RECOVERY - puppet last run on oresrdb1001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [10:45:51] RECOVERY - puppet last run on mw1285 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [10:45:51] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [10:45:52] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [10:46:31] RECOVERY - puppet last run on mw1194 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [10:46:35] (03PS2) 10Muehlenhoff: Add debdeploy salt grains for failoid [puppet] - 10https://gerrit.wikimedia.org/r/349408 [10:46:51] RECOVERY - puppet last run on db1079 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [10:46:51] RECOVERY - puppet last run on elastic1041 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [10:46:51] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [10:47:01] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [10:47:31] RECOVERY - puppet last run on mw1230 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [10:47:51] RECOVERY - puppet last run on cp1099 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [10:47:51] RECOVERY - puppet last run on mc1012 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [10:47:51] RECOVERY - puppet last run on mw1298 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [10:47:52] RECOVERY - puppet last run on restbase1009 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [10:47:52] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [10:47:52] RECOVERY - puppet last run on cp1008 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [10:47:52] RECOVERY - puppet last run on mw1206 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [10:48:21] RECOVERY - puppet last run on multatuli is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [10:49:01] RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [10:50:23] (03CR) 10Muehlenhoff: [C: 032] Add debdeploy salt grains for failoid [puppet] - 10https://gerrit.wikimedia.org/r/349408 (owner: 10Muehlenhoff) [10:56:38] (03CR) 10Hashar: "And I have found the culprit. loggingTests::provideAvroSchemas() was setting globals directly which fill up the global scope which are the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349210 (owner: 10Hashar) [10:57:00] (03PS2) 10Hashar: phpunit: automatically backup globals between tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349210 [11:00:53] (03PS1) 10Alexandros Kosiaris: Avoid double hiera lookups in puppetmaster::frontend [puppet] - 10https://gerrit.wikimedia.org/r/349409 [11:02:41] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.122 second response time [11:19:58] (03PS5) 10Elukey: Refactor role::piwik in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/348938 (https://phabricator.wikimedia.org/T159136) [11:20:45] -1 incoming from jenkins.. [11:21:01] (03CR) 10jerkins-bot: [V: 04-1] Refactor role::piwik in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/348938 (https://phabricator.wikimedia.org/T159136) (owner: 10Elukey) [11:22:37] (03PS6) 10Elukey: Refactor role::piwik in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/348938 (https://phabricator.wikimedia.org/T159136) [11:26:14] 06Operations: Reimage/rename codfw pool counters - https://phabricator.wikimedia.org/T149298#3201111 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [11:30:11] PROBLEM - Check Varnish expiry mailbox lag on cp2024 is CRITICAL: CRITICAL: expiry mailbox lag is 718746 [11:32:20] (03PS4) 10Jcrespo: Kill long running queries with stricter conditions [software] - 10https://gerrit.wikimedia.org/r/346559 (https://phabricator.wikimedia.org/T160984) [11:35:51] (03PS1) 10Hashar: phpunit: factor out logic to handle globals vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349413 [11:37:14] (03CR) 10Hashar: "loggingTests::providedAvroSchemas() was the low hanging fruit. cirrusTest suffers from a similar issue so I factored out the globals hand" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349210 (owner: 10Hashar) [11:39:06] (03PS5) 10Jcrespo: Kill long running queries with stricter conditions [software] - 10https://gerrit.wikimedia.org/r/346559 (https://phabricator.wikimedia.org/T160984) [11:40:02] (03PS7) 10Elukey: Refactor role::piwik in multiple profiles [puppet] - 10https://gerrit.wikimedia.org/r/348938 (https://phabricator.wikimedia.org/T159136) [11:42:34] (03CR) 10Elukey: "Andrew, this is the first draft of the refactoring.. Mind to check (whenever you have time) if it makes sense?" [puppet] - 10https://gerrit.wikimedia.org/r/348938 (https://phabricator.wikimedia.org/T159136) (owner: 10Elukey) [11:51:41] (03PS2) 10Alexandros Kosiaris: Avoid double hiera lookups in puppetmaster::frontend [puppet] - 10https://gerrit.wikimedia.org/r/349409 [11:51:48] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Avoid double hiera lookups in puppetmaster::frontend [puppet] - 10https://gerrit.wikimedia.org/r/349409 (owner: 10Alexandros Kosiaris) [11:53:42] (03CR) 10Marostegui: Kill long running queries with stricter conditions (031 comment) [software] - 10https://gerrit.wikimedia.org/r/346559 (https://phabricator.wikimedia.org/T160984) (owner: 10Jcrespo) [12:01:23] (03PS6) 10Jcrespo: Kill long running queries with stricter conditions [software] - 10https://gerrit.wikimedia.org/r/346559 (https://phabricator.wikimedia.org/T160984) [12:03:17] (03PS1) 10Marostegui: db-eqiad.php: Specify why db1080 is depooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349414 [12:03:45] (03CR) 10Marostegui: [C: 031] Kill long running queries with stricter conditions [software] - 10https://gerrit.wikimedia.org/r/346559 (https://phabricator.wikimedia.org/T160984) (owner: 10Jcrespo) [12:04:50] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Specify why db1080 is depooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349414 (owner: 10Marostegui) [12:06:01] (03Merged) 10jenkins-bot: db-eqiad.php: Specify why db1080 is depooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349414 (owner: 10Marostegui) [12:06:09] (03CR) 10jenkins-bot: db-eqiad.php: Specify why db1080 is depooled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349414 (owner: 10Marostegui) [12:07:49] !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Update db1080 depool reason (duration: 01m 18s) [12:07:53] !log Analyze revision, logging and page table on s1 db1080 - T116557 [12:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:04] T116557: AFComputedVariable::compute query timeouts - https://phabricator.wikimedia.org/T116557 [12:08:28] (03PS1) 10Volans: Puppet: run-puppet-agent, add --failed-only option [puppet] - 10https://gerrit.wikimedia.org/r/349416 [12:12:01] PROBLEM - Check Varnish expiry mailbox lag on cp2022 is CRITICAL: CRITICAL: expiry mailbox lag is 649297 [12:20:07] (03PS1) 10Alexandros Kosiaris: puppetmaster: Re-depool puppetmaster1002 [puppet] - 10https://gerrit.wikimedia.org/r/349419 (https://phabricator.wikimedia.org/T148506) [12:20:39] (03PS2) 10Volans: Puppet: run-puppet-agent, add --failed-only option [puppet] - 10https://gerrit.wikimedia.org/r/349416 [12:21:17] (03CR) 10Alexandros Kosiaris: [C: 031] Add separated SRV records for etcd to consume for conftool [dns] - 10https://gerrit.wikimedia.org/r/349380 (https://phabricator.wikimedia.org/T159687) (owner: 10Giuseppe Lavagetto) [12:22:00] (03CR) 10Alexandros Kosiaris: [C: 031] hieradata: remove etcd_hosts, unused [puppet] - 10https://gerrit.wikimedia.org/r/349384 (owner: 10Giuseppe Lavagetto) [12:30:28] (03CR) 10Volans: [C: 04-1] "I found some cases in which there are more status in the yaml file, need to improve the check" [puppet] - 10https://gerrit.wikimedia.org/r/349416 (owner: 10Volans) [12:32:39] (03CR) 10Alexandros Kosiaris: [C: 031] etcd: make our rw clients use the new SRV record [puppet] - 10https://gerrit.wikimedia.org/r/349386 (https://phabricator.wikimedia.org/T159687) (owner: 10Giuseppe Lavagetto) [12:35:16] (03PS3) 10Volans: Puppet: run-puppet-agent, add --failed-only option [puppet] - 10https://gerrit.wikimedia.org/r/349416 [12:37:00] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster: Re-depool puppetmaster1002 [puppet] - 10https://gerrit.wikimedia.org/r/349419 (https://phabricator.wikimedia.org/T148506) (owner: 10Alexandros Kosiaris) [12:37:05] (03PS2) 10Alexandros Kosiaris: puppetmaster: Re-depool puppetmaster1002 [puppet] - 10https://gerrit.wikimedia.org/r/349419 (https://phabricator.wikimedia.org/T148506) [12:39:26] (03CR) 10ArielGlenn: [C: 031] "I did a grep through to see if we are using a check of a bool value anywhere else (answer: no). I also did tests to make sure it will stil" [puppet] - 10https://gerrit.wikimedia.org/r/348928 (https://phabricator.wikimedia.org/T146914) (owner: 10Hashar) [12:41:32] (03CR) 10Alexandros Kosiaris: [C: 032] "as expected in https://puppet-compiler.wmflabs.org/6200/puppetmaster1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/349419 (https://phabricator.wikimedia.org/T148506) (owner: 10Alexandros Kosiaris) [12:41:51] PROBLEM - Check Varnish expiry mailbox lag on cp2026 is CRITICAL: CRITICAL: expiry mailbox lag is 623957 [12:49:18] (03CR) 10Alexandros Kosiaris: [C: 04-1] role::configcluster: reconfigure etcd replication (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/349385 (https://phabricator.wikimedia.org/T159687) (owner: 10Giuseppe Lavagetto) [12:51:03] !log reboot puppetmaster1002 for kernel upgrade [12:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:17] 06Operations, 10netops: Interface errors on cr2-eqiad:xe-4/3/1 - https://phabricator.wikimedia.org/T163542#3201340 (10ayounsi) Telia ticket #00725688 opened. [13:13:50] (03PS1) 10Marostegui: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349430 (https://phabricator.wikimedia.org/T162539) [13:17:20] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349430 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui) [13:18:28] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349430 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui) [13:18:36] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349430 (https://phabricator.wikimedia.org/T162539) (owner: 10Marostegui) [13:20:16] (03Draft2) 10Zppix: Fixes EducationProgram user rights so that they can be assigned by sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163511) [13:20:23] (03PS3) 10Zppix: Fixes EducationProgram user rights so that they can be assigned by sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163511) [13:20:28] !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Depool db1092 - T162539 T163548 (duration: 01m 18s) [13:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:39] T162539: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539 [13:20:40] T163548: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548 [13:22:15] (03PS4) 10Zppix: Fixes EducationProgram user rights so that they can be assigned by sysops & Bureaucrats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163511) [13:23:05] Urbanecm: ^^ [13:23:27] (03CR) 10BBlack: [C: 031] cache_upload: increase small object threshold to 1024B [puppet] - 10https://gerrit.wikimedia.org/r/349389 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [13:24:49] (03PS3) 10Ema: cache_upload: increase small object threshold to 1024B [puppet] - 10https://gerrit.wikimedia.org/r/349389 (https://phabricator.wikimedia.org/T145661) [13:24:58] (03CR) 10Ema: [V: 032 C: 032] cache_upload: increase small object threshold to 1024B [puppet] - 10https://gerrit.wikimedia.org/r/349389 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema) [13:25:05] `/win 58 [13:31:18] (03PS1) 10Rush: wmcs: better output from nfs-mount-manager [puppet] - 10https://gerrit.wikimedia.org/r/349433 (https://phabricator.wikimedia.org/T161898) [13:35:10] (03PS1) 10Alexandros Kosiaris: Switch oresrdb.svc.eqiad.wmnet to oresrdb1002 [dns] - 10https://gerrit.wikimedia.org/r/349434 (https://phabricator.wikimedia.org/T163326) [13:35:16] (03PS2) 10Rush: wmcs: better output from nfs-mount-manager [puppet] - 10https://gerrit.wikimedia.org/r/349433 (https://phabricator.wikimedia.org/T161898) [13:35:33] !log Deploy alter table on wikidatawiki.wb_terms on db1092 - T162539 T163548 [13:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:42] T162539: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539 [13:35:42] T163548: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548 [13:39:11] (03PS2) 10Alexandros Kosiaris: Switch oresrdb.svc.eqiad.wmnet to oresrdb1002 [dns] - 10https://gerrit.wikimedia.org/r/349434 (https://phabricator.wikimedia.org/T163326) [13:39:15] (03PS5) 10Zppix: Fixes EducationProgram user rights so that they can be assigned/removed by sysops & Bureaucrats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163511) [13:39:53] (03CR) 10Alexandros Kosiaris: [C: 032] Switch oresrdb.svc.eqiad.wmnet to oresrdb1002 [dns] - 10https://gerrit.wikimedia.org/r/349434 (https://phabricator.wikimedia.org/T163326) (owner: 10Alexandros Kosiaris) [13:43:25] (03CR) 10Rush: [C: 032] wmcs: better output from nfs-mount-manager [puppet] - 10https://gerrit.wikimedia.org/r/349433 (https://phabricator.wikimedia.org/T161898) (owner: 10Rush) [13:55:12] (03PS1) 10Marostegui: db-codfw.php: Increase API weight db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349436 [13:56:17] moritzm, o/ [13:56:27] Ready for Wikilabels DB maintenance :D [13:56:43] I'll be ready to get rid of our maintenance notice when the 10 seconds are up. [13:56:44] * moritzm too, will proceed at 14 [13:56:47] kk [13:59:01] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [14:00:08] !log installing postgresql bugfix update from jessie point release on labsdb1004 [14:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:38] halfak: upgraded, let me know if you run into any problems [14:01:42] moritzm, looks like we weren't robust to this. [14:01:50] Restarting the web service should solve the issue. [14:03:01] Yup. Looks OK [14:03:11] moritzm, thanks for working with us on this one :) [14:03:27] * halfak removes maintenance notice [14:03:41] halfak: ok, great. I'll add a note to our https://wikitech.wikimedia.org/wiki/Service_restarts page to coordinate all future restarts of that kind with you [14:04:23] Great :) [14:07:01] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:14:11] PROBLEM - Check Varnish expiry mailbox lag on cp2020 is CRITICAL: CRITICAL: expiry mailbox lag is 708336 [14:23:21] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [14:26:24] !log ban objects with CT < 1024 on codfw cache_upload T145661 [14:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:32] T145661: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661 [14:26:52] chasemp: are the unmerged changes in puppet yours? [14:26:57] ^^^ [14:27:11] volans: not to my knowledget, I merged the only thing I've submitted today [14:27:15] heh, s/CT/CL/ in my SAL entry of course :) [14:27:27] volans: although I did it on puppetmaster1002 [14:27:30] possibly [14:27:46] yes, I wonder if stuck on 1001 for $reasons [14:28:11] ...are we meant to puppet-merge on both directly? [14:28:31] akosiaris: ^ quick q for you [14:29:21] 06Operations, 10ops-codfw, 10DBA: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3201510 (10Papaul) @Marostegui Anytime Monday at 9:30am works for me. [14:29:22] chasemp: I always merge on puppetmaster1001 and AFAIK 1002 was "depooled" earlier today [14:29:35] chasemp: puppetmaster1001 and puppetmaster2001 [14:29:39] the "frontends" [14:29:52] 1002 and 2002 are "backends" [14:29:59] always merge on the frontends [14:30:06] ok, that's on me. I was already on puppetmaster1002 and didn't realize that [14:30:17] updating [14:30:21] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [14:30:38] thanks akosiaris volans [14:30:42] yw [14:30:43] np [14:30:44] 06Operations, 10ops-codfw, 10DBA: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3201514 (10Marostegui) >>! In T163339#3201510, @Papaul wrote: > @Marostegui Anytime Monday at 9:30am works for me. Let's do it Monday at 9:30AM then. [14:32:40] !log Analyze revision, logging and page table on s1 db1067 - https://phabricator.wikimedia.org/T116557 [14:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:31] PROBLEM - Check Varnish expiry mailbox lag on cp2011 is CRITICAL: CRITICAL: expiry mailbox lag is 819774 [14:56:55] I am going to saturate db2062 (depooled) with connections [14:57:06] it is also on downtime [14:57:26] it is to test the query killer changes [15:02:48] (03CR) 10Marostegui: [C: 032] db-codfw.php: Increase API weight db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349436 (owner: 10Marostegui) [15:04:34] (03Merged) 10jenkins-bot: db-codfw.php: Increase API weight db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349436 (owner: 10Marostegui) [15:05:45] (03CR) 10jenkins-bot: db-codfw.php: Increase API weight db2071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349436 (owner: 10Marostegui) [15:06:14] !log marostegui@naos Synchronized wmf-config/db-codfw.php: Increase weight db2071 (duration: 01m 17s) [15:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:24] yeah, that didn't work [15:07:07] it killed 2 connections out of 10000 [15:09:23] (03PS1) 10Milimetric: Tune cache for analytics.wikimedia.org data files [puppet] - 10https://gerrit.wikimedia.org/r/349447 (https://phabricator.wikimedia.org/T163338) [15:15:06] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3201613 (10akosiaris) [15:15:09] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: switchover oresrdb.svc.eqiad.wmnet from oresrdb1001 to oresrdb1002 - https://phabricator.wikimedia.org/T163326#3201611 (10akosiaris) 05Open>03Resolved This was done as well. There is an action item discovered during this which would be to automa... [15:18:46] (03PS1) 10Joal: Add unique devices in pivot config [puppet] - 10https://gerrit.wikimedia.org/r/349449 (https://phabricator.wikimedia.org/T159471) [15:19:27] elukey: if you have aminute --^ [15:23:12] sure :) [15:46:18] 06Operations, 10Traffic, 06Community-Liaisons (Jul-Sep 2017): Communicate this security change to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3201700 (10Whatamidoing-WMF) a:05BBlack>03None [16:02:43] (03CR) 10Elukey: [C: 032] Add unique devices in pivot config [puppet] - 10https://gerrit.wikimedia.org/r/349449 (https://phabricator.wikimedia.org/T159471) (owner: 10Joal) [16:03:54] (03PS2) 10Elukey: Tune cache for analytics.wikimedia.org data files [puppet] - 10https://gerrit.wikimedia.org/r/349447 (https://phabricator.wikimedia.org/T163338) (owner: 10Milimetric) [16:14:11] RECOVERY - Check Varnish expiry mailbox lag on cp2020 is OK: OK: expiry mailbox lag is 35 [16:14:42] (03CR) 10Elukey: [C: 032] Tune cache for analytics.wikimedia.org data files [puppet] - 10https://gerrit.wikimedia.org/r/349447 (https://phabricator.wikimedia.org/T163338) (owner: 10Milimetric) [16:15:18] (03CR) 10Framawiki: [C: 031] Remove all feeds added in T127176 from RSS whitelist for mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348782 (https://phabricator.wikimedia.org/T163217) (owner: 10Urbanecm) [16:20:51] 06Operations, 10Phabricator: Intermittent DB connectivity problem on phabricator, needs investigation - https://phabricator.wikimedia.org/T163507#3201808 (10greg) >>! In T163507#3200524, @mmodell wrote: > I've enabled rate limiting in phabricator and @Joe enabled `tw_reuse` on the sql proxy. Pretty sure that... [16:44:56] (03CR) 10Framawiki: "Linked bug: T163344" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348951 (owner: 10Jcrespo) [16:46:41] (03PS2) 10Giuseppe Lavagetto: hieradata: remove etcd_hosts, unused [puppet] - 10https://gerrit.wikimedia.org/r/349384 [16:46:50] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] hieradata: remove etcd_hosts, unused [puppet] - 10https://gerrit.wikimedia.org/r/349384 (owner: 10Giuseppe Lavagetto) [16:48:01] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [16:51:01] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:00:22] (03PS2) 10Giuseppe Lavagetto: role::configcluster: reconfigure etcd replication [puppet] - 10https://gerrit.wikimedia.org/r/349385 (https://phabricator.wikimedia.org/T159687) [17:01:01] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:03:25] (03CR) 10Giuseppe Lavagetto: [C: 032] "I think I solved all the issues pointed out by akosiaris, merging in good faith as I want to unblock puppet on conf2002." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/349385 (https://phabricator.wikimedia.org/T159687) (owner: 10Giuseppe Lavagetto) [17:07:01] RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.006 second response time [17:07:41] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active [17:12:52] (03CR) 10Alexandros Kosiaris: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/349385 (https://phabricator.wikimedia.org/T159687) (owner: 10Giuseppe Lavagetto) [17:13:01] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:14:02] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:15:45] (03PS1) 10Giuseppe Lavagetto: profile::conftool::master: make the git root dir a parameter [puppet] - 10https://gerrit.wikimedia.org/r/349468 (https://phabricator.wikimedia.org/T156924) [17:16:09] 06Operations, 10Scap (Scap3-MediaWiki-MVP): Depool proxies temporarily while scap is ongoing to avoid taxing those nodes - https://phabricator.wikimedia.org/T125629#3201925 (10mmodell) [17:16:13] 06Operations, 06Performance-Team, 07HHVM, 10Scap (Scap3-MediaWiki-MVP), 03releng-201617-q4: Make scap able to depool/repool servers via the conftool API - https://phabricator.wikimedia.org/T104352#3201926 (10mmodell) [17:16:25] 06Operations, 10Scap (Scap3-MediaWiki-MVP): Depool proxies temporarily while scap is ongoing to avoid taxing those nodes - https://phabricator.wikimedia.org/T125629#1993054 (10mmodell) [17:16:33] 06Operations, 06Performance-Team, 07HHVM, 10Scap (Scap3-MediaWiki-MVP), 03releng-201617-q4: Make scap able to depool/repool servers via the conftool API - https://phabricator.wikimedia.org/T104352#1414314 (10mmodell) [17:16:44] 06Operations, 10Scap (Scap3-MediaWiki-MVP): Depool proxies temporarily while scap is ongoing to avoid taxing those nodes - https://phabricator.wikimedia.org/T125629#1993054 (10mmodell) [17:16:52] 06Operations, 06Performance-Team, 07HHVM, 10Scap (Scap3-MediaWiki-MVP), 03releng-201617-q4: Make scap able to depool/repool servers via the conftool API - https://phabricator.wikimedia.org/T104352#1414314 (10mmodell) [17:18:02] (03CR) 10Giuseppe Lavagetto: "cherry-picked in beta, it unbreaks the puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/349468 (https://phabricator.wikimedia.org/T156924) (owner: 10Giuseppe Lavagetto) [17:20:16] (03CR) 10Reedy: [C: 04-1] "Spacing is wrong" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163511) (owner: 10Zppix) [17:30:01] (03PS2) 10Giuseppe Lavagetto: conftool: add mwconfig object type, define the first couple variables [puppet] - 10https://gerrit.wikimedia.org/r/347360 [17:30:06] (03CR) 10Giuseppe Lavagetto: conftool: add mwconfig object type, define the first couple variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/347360 (owner: 10Giuseppe Lavagetto) [17:31:19] (03CR) 10Giuseppe Lavagetto: "cherry-picked in beta for T156924" [puppet] - 10https://gerrit.wikimedia.org/r/347360 (owner: 10Giuseppe Lavagetto) [17:43:06] 06Operations: cannot SSH into bast1001 - keep getting prompted for password - https://phabricator.wikimedia.org/T163568#3202018 (10Capt_Swing) [17:43:22] 06Operations: Education List serve filter problem - https://phabricator.wikimedia.org/T163569#3202030 (10NSaad) [17:43:33] 06Operations: cannot SSH into bast1001 - keep getting prompted for password - https://phabricator.wikimedia.org/T163568#3202056 (10Capt_Swing) [17:43:54] (03PS3) 10Giuseppe Lavagetto: conftool: add mwconfig object type, define the first couple variables [puppet] - 10https://gerrit.wikimedia.org/r/347360 [17:44:19] 06Operations: cannot SSH into bast1001 - keep getting prompted for password - https://phabricator.wikimedia.org/T163568#3202018 (10Capt_Swing) [18:01:16] 06Operations, 10MediaWiki-Configuration, 06MediaWiki-Platform-Team, 06Performance-Team, and 9 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3202098 (10Joe) I managed to get a bare-bones working installation of conftool in deployment... [18:03:59] 06Operations, 10MediaWiki-Configuration, 06MediaWiki-Platform-Team, 06Performance-Team, and 9 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3202103 (10Joe) To summarize, I think it is possible to test `EtcdConfig` in beta at this po... [18:06:05] (03CR) 10Giuseppe Lavagetto: "If you want to test this in labs, refer to the instructions here: https://phabricator.wikimedia.org/T156924#3202098" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [18:11:11] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:12:02] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [18:13:09] 06Operations: cannot SSH into bast1001 - keep getting prompted for password - https://phabricator.wikimedia.org/T163568#3202115 (10Capt_Swing) Update I changed ``` IdentityFile ~/.ssh/id_rsa_bastion.pub ``` to ``` IdentityFile ~/.ssh/id_rsa_bastion ``` Still unable to connect. [18:13:11] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [18:15:01] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [1000.0] [18:15:14] many thumbs dont load properly [18:15:34] but it is not reproducible [18:16:59] probably the image backend [18:19:59] (03PS2) 1020after4: Add 90.231.10.86 to phabbanlist, this crawler is causing outages. [puppet] - 10https://gerrit.wikimedia.org/r/349342 [18:20:01] (03PS1) 1020after4: Include ::profile::conftool::client on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/349498 (https://phabricator.wikimedia.org/T163565) [18:21:32] (03PS2) 1020after4: Include ::profile::conftool::client on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/349498 (https://phabricator.wikimedia.org/T163565) [18:25:57] (03CR) 10Paladox: [C: 031] Add 90.231.10.86 to phabbanlist, this crawler is causing outages. [puppet] - 10https://gerrit.wikimedia.org/r/349342 (owner: 1020after4) [18:27:01] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:31:01] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [18:37:52] (03PS9) 10Catrope: Enable RCFilters beta feature on all wikis except wikidatawiki, nlwiki, cswiki, etwiki and hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/343439 (https://phabricator.wikimedia.org/T144458) [18:38:46] (03CR) 10EddieGP: [C: 031] Add 90.231.10.86 to phabbanlist, this crawler is causing outages. [puppet] - 10https://gerrit.wikimedia.org/r/349342 (owner: 1020after4) [18:41:31] RECOVERY - Check Varnish expiry mailbox lag on cp2011 is OK: OK: expiry mailbox lag is 60775 [18:44:59] things are getting better- it could be related to the higher number of request we are getting [19:01:34] anybody mind reviewing this? It's cherry-picked to the integration boxes and seems to be working! https://gerrit.wikimedia.org/r/349343 [19:02:03] (03PS6) 10Ejegg: Ensure mcrypt enabled on integration slaves [puppet] - 10https://gerrit.wikimedia.org/r/349343 [19:02:23] rebased and ffwd-able ^^^ [19:08:01] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:10:01] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:17:17] (03CR) 10Jcrespo: [C: 032] Ensure mcrypt enabled on integration slaves [puppet] - 10https://gerrit.wikimedia.org/r/349343 (owner: 10Ejegg) [19:17:51] thanks Jcrespo! [19:18:28] for one thing he does, he shoulnd't take much credit [19:30:01] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:34:27] (03PS1) 10Andrew Bogott: Designate: Allow labs clients to access the designate API. [puppet] - 10https://gerrit.wikimedia.org/r/349531 (https://phabricator.wikimedia.org/T45580) [19:49:45] (03PS7) 10Jcrespo: Kill long running queries with stricter conditions [software] - 10https://gerrit.wikimedia.org/r/346559 (https://phabricator.wikimedia.org/T160984) [19:54:54] 06Operations, 10DBA, 10Wikimedia-General-or-Unknown: Spurious completely empty `image` table row on commonswiki - https://phabricator.wikimedia.org/T155769#3202432 (10jcrespo) > (I volunteer) Please don't. [19:56:04] (03CR) 10Catrope: [C: 032] Force Labs to eqiad, since all the services are there. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349349 (https://phabricator.wikimedia.org/T163514) (owner: 10Mattflaschen) [19:59:01] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [20:06:47] (03PS6) 10Zppix: Fixes EducationProgram user rights so that they can be assigned/removed by sysops & Bureaucrats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163167) [20:07:44] Reedy: ^ [20:09:12] (03PS7) 10Zppix: Fixes EducationProgram user rights so that they can be assigned/removed by sysops & Bureaucrats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349427 (https://phabricator.wikimedia.org/T163167) [20:12:39] (03Merged) 10jenkins-bot: Force Labs to eqiad, since all the services are there. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349349 (https://phabricator.wikimedia.org/T163514) (owner: 10Mattflaschen) [20:12:48] (03CR) 10jenkins-bot: Force Labs to eqiad, since all the services are there. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349349 (https://phabricator.wikimedia.org/T163514) (owner: 10Mattflaschen) [20:15:01] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:16:11] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [20:18:01] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [20:20:03] wonder what the problem is ^^? [20:32:36] jynus another user reported thumbernile issues in wikipedia-en. [20:47:28] paladox: i think it happened when dc switch occoured idk if its related though [20:47:45] Zppix probaly unrelated as it happened before with eqiad. [20:47:57] though i thought it was fixed. [20:48:19] paladox: i dont know too much i just recall something happening with 5xx a few days ago [20:48:30] ok [20:52:01] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:07:01] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:09:24] (03PS1) 10Alexandros Kosiaris: icinga: Adjust the frequency of sync-icinga-state [puppet] - 10https://gerrit.wikimedia.org/r/349603 (https://phabricator.wikimedia.org/T163286) [21:10:36] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] icinga: Adjust the frequency of sync-icinga-state [puppet] - 10https://gerrit.wikimedia.org/r/349603 (https://phabricator.wikimedia.org/T163286) (owner: 10Alexandros Kosiaris) [21:14:11] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [21:19:33] I want to file a bug because it seems static assets of an extension (only one) can not be reached externally (unlike https://en.wikipedia.org/w/extensions/Echo/modules/controller/mw.echo.Controller.js) but before doing so is there a configuration somewhere so I can check? The extension is deployed and accessible (and assets are accessible internally as well) [21:33:34] nvm, found out that the extensions is actually bundled with other ones (that's super weird) [21:56:11] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:06:59] I've been getting [22:07:01] Request from 173.161.178.89 via cp2026 cp2026, Varnish XID 346270525 [22:07:02] Error: 503, Service Unavailable at Fri, 21 Apr 2017 22:04:24 GMT [22:07:10] and other images on Commons full-resolution images. [22:07:40] Is that thumberniles? [22:08:01] other users have reported that too though not the error only saying thumberniles failed to load. [22:10:01] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:16:47] Having that too, opening random images in full size view on commons results in (1) loading them as espected (2) firefox saying "can't show image because it contains errors" (3) getting 503 Wikimedia error page shown (pick one from 1-3 randomly with same probability each). [22:17:51] and me too [22:17:58] https://commons.wikimedia.org/wiki/File:Copenhagen_(33071391560).jpg [22:18:08] that works now [22:18:19] (03PS1) 10Eevans: WIP: Create a Cassandra 3.7 configuration [puppet] - 10https://gerrit.wikimedia.org/r/349668 (https://phabricator.wikimedia.org/T160570) [22:18:20] Possibly related to T147992 [22:18:21] T147992: Large image https://commons.wikimedia.org/wiki/File:Map_of_Hindoostan,_1788,_by_Rennell.jpg can't create thumbs - https://phabricator.wikimedia.org/T147992 [22:18:21] Well it's not specific files [22:18:35] since the files started working soon after. [22:18:42] I can load the very same image with F5 over and over, getting different responses [22:18:48] (03CR) 10Eevans: [C: 04-1] "Not even close..." [puppet] - 10https://gerrit.wikimedia.org/r/349668 (https://phabricator.wikimedia.org/T160570) (owner: 10Eevans) [22:18:50] oh. [22:20:24] i wonder if it doesnt have to do with upload http 5xx error that icinga reported at 4:07 pm (UTC-5) [22:21:06] (03PS2) 10Eevans: WIP: Create a Cassandra 3.7 configuration [puppet] - 10https://gerrit.wikimedia.org/r/349668 (https://phabricator.wikimedia.org/T160570) [22:21:26] ill keep refreshing and will notate when if it happens again [22:21:35] 06Operations, 06Commons, 06Multimedia: thumbnails on commons are not showing correctly - https://phabricator.wikimedia.org/T163610#3202971 (10Paladox) [22:21:46] eddiegp ive filled a new task ^^ [22:22:22] 06Operations, 06Commons, 06Multimedia: thumbnails on commons are not showing correctly - https://phabricator.wikimedia.org/T163610#3202971 (10Zppix) Possibly due to Upload 5xx errors? Icinga keeps reporting that along with esams and codfw 5xx errors. [22:25:11] paladox: Thanks. Are you sure it's right to call this a thumbnail issue? As is seems to happen with full-resolution images (e.g. what you linked) too [22:25:43] Oh. Um, i thought that was a thumbernile. But yes i should rename it to thumbiles and fullscreen [22:26:54] Yeah, thanks [22:27:02] your welcome [22:42:11] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] [22:44:24] looks like a spam of patches is all [22:44:37] by antoine, of course :) [22:44:50] https://gerrit.wikimedia.org/r/#/q/topic:T119973+%28status:open+OR+status:merged%29 [22:44:51] T119973: Convert all repos to use npm Jenkins job with jsonlint and eslint - https://phabricator.wikimedia.org/T119973 [22:46:11] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 78.57% of data above the critical threshold [140.0] [22:46:51] PROBLEM - nova-compute process on labvirt1003 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [22:47:51] RECOVERY - nova-compute process on labvirt1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [22:48:02] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [22:49:58] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp2026.codfw.wmnet,service=varnish-be [22:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:33] 06Operations: Production Shell access denied - https://phabricator.wikimedia.org/T163568#3203078 (10Capt_Swing) [23:04:34] 06Operations: Production Shell access denied - https://phabricator.wikimedia.org/T163568#3202018 (10Capt_Swing) [23:13:01] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:41:51] RECOVERY - Check Varnish expiry mailbox lag on cp2026 is OK: OK: expiry mailbox lag is 0 [23:47:11] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] [23:52:07] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp2026.codfw.wmnet,service=varnish-be [23:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log