[00:05:38] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): rename naos to deploy2001 and reinstall with stretch - https://phabricator.wikimedia.org/T193916#4212723 (10Dzahn) [00:06:41] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): rename naos to deploy2001 and reinstall with stretch - https://phabricator.wikimedia.org/T193916#4183490 (10Dzahn) >>! In T193916#4212246, @RobH wrote: > Please ensure when the rename is done, a sub-task for the on-site (@pap... [00:09:13] PROBLEM - ensure kvm processes are running on labvirt1019 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 [00:09:43] PROBLEM - ensure kvm processes are running on labvirt1020 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 [00:15:32] ACKNOWLEDGEMENT - DNS naos.mgmt on naos.mgmt is CRITICAL: Domain naos.mgmt.codfw.wmnet was not found by the server daniel_zahn T193916 [00:17:23] RECOVERY - ensure kvm processes are running on labvirt1020 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 [00:18:13] RECOVERY - ensure kvm processes are running on labvirt1019 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 [00:19:55] !log rdb2004 - down in Icinga since >1d, nothing on console, dont see a SAL entry. powercycling [00:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:51] RECOVERY - Host rdb2004 is UP: PING OK - Packet loss = 0%, RTA = 36.15 ms [00:24:51] PROBLEM - Check health of redis instance on 6381 on rdb2004 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6381 [00:27:21] RECOVERY - Check health of redis instance on 6381 on rdb2004 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 4604236 keys, up 5 minutes 47 seconds [00:38:04] jouncebot: next [00:38:04] In 82 hour(s) and 21 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180521T1100) [00:45:05] (03PS1) 10Dzahn: xenon: replace base::service_unit, rm upstart template [puppet] - 10https://gerrit.wikimedia.org/r/433684 (https://phabricator.wikimedia.org/T194724) [00:46:56] (03PS1) 10Dzahn: dumps: replace base::service_unit with systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433685 (https://phabricator.wikimedia.org/T194724) [00:49:51] (03PS1) 10Dzahn: ircecho: replace base::service_unit with systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433686 (https://phabricator.wikimedia.org/T194724) [01:27:00] !log syncing wmf.4 again to deploy https://gerrit.wikimedia.org/r/#/c/433673/ refs T194900 T191050 [01:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:27:05] T194900: Undefined variable: nonce in ResourceLoaderClientHtml.php - https://phabricator.wikimedia.org/T194900 [01:27:05] T191050: 1.32.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T191050 [01:34:31] !log twentyafterfour@tin Synchronized php-1.32.0-wmf.4: sync wmf.4 to deploy https://gerrit.wikimedia.org/r/#/c/433673/ (duration: 09m 54s) [01:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:32] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T194907#4212678 (10Dzahn) duplicate of T194851 the event-handler created the ticket twice. that in itself might deserve another ticket. [03:28:12] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 854.58 seconds [03:46:21] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [03:51:22] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [04:02:32] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 159.60 seconds [04:18:22] PROBLEM - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.129 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [04:19:41] RECOVERY - Router interfaces on cr1-eqsin is OK: OK: host 103.102.166.129, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [04:52:52] PROBLEM - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.129 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [04:54:11] RECOVERY - Router interfaces on cr1-eqsin is OK: OK: host 103.102.166.129, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [05:17:08] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4212803 (10Marostegui) This time it worked ``` logicaldrive 1 (3.3 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) physicaldrive 1I:1:2 (port 1I:box 1:bay 2,... [05:18:06] 10Operations, 10ops-eqiad, 10DBA: Possibly BBU issues on db1067 - https://phabricator.wikimedia.org/T194852#4212805 (10Marostegui) Looks like it was a one time thing: ``` root@db1067:~# megacli -AdpBbuCmd -a0 | grep Temper Temperature: 47 C Temperature : OK ``` I am going to... [05:18:16] !log Stop MySQL and reboot db1067 - T194852 [05:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:20] T194852: Possibly BBU issues on db1067 - https://phabricator.wikimedia.org/T194852 [05:20:20] (03PS1) 10Marostegui: db-eqiad.php: Repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433689 (https://phabricator.wikimedia.org/T193847) [05:22:02] (03PS2) 10Marostegui: Revert "wiki replicas: depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/433609 (owner: 10Bstorm) [05:22:15] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433689 (https://phabricator.wikimedia.org/T193847) (owner: 10Marostegui) [05:22:54] (03CR) 10Marostegui: [C: 032] Revert "wiki replicas: depool labsdb1011" [puppet] - 10https://gerrit.wikimedia.org/r/433609 (owner: 10Bstorm) [05:23:29] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433689 (https://phabricator.wikimedia.org/T193847) (owner: 10Marostegui) [05:23:45] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433689 (https://phabricator.wikimedia.org/T193847) (owner: 10Marostegui) [05:24:59] !log Reload haproxy on dbproxy1010 to repool labsdb1011 [05:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:38] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1066 - T193847 (duration: 01m 22s) [05:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:42] T193847: Move db1066 to row A - https://phabricator.wikimedia.org/T193847 [05:26:43] 10Operations, 10ops-eqiad, 10DBA: Possibly BBU issues on db1067 - https://phabricator.wikimedia.org/T194852#4212814 (10Marostegui) After reboot: ``` root@db1067:~# megacli -AdpBbuCmd -a0 | grep Temper Temperature: 48 C Temperature : OK ``` [05:27:09] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Move db1066 to row A - https://phabricator.wikimedia.org/T193847#4212815 (10Marostegui) 05Open>03Resolved Server repooled Thanks Chris for getting this done! [05:33:24] !log Deploy schema change on dbstore1002:s3 - T191519 T188299 T190148 [05:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:30] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [05:33:30] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [05:33:30] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [05:34:51] PROBLEM - NTP peers on dns5001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:36:01] RECOVERY - NTP peers on dns5001 is OK: NTP OK: Offset 0.001684 secs [05:37:48] !log Stop MySQL on db1120 to copy its content to db2075 - T190704 [05:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:52] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [05:43:25] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2067 - https://phabricator.wikimedia.org/T194103#4212838 (10Marostegui) 05Open>03Resolved [05:45:22] PROBLEM - NTP peers on dns5002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:46:31] RECOVERY - NTP peers on dns5002 is OK: NTP OK: Offset 0.001592 secs [05:47:22] (03PS2) 10Marostegui: db1120: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/433331 [05:49:08] (03PS1) 10Marostegui: db1067.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/433692 (https://phabricator.wikimedia.org/T194852) [05:49:41] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) timed out before a response was received [05:50:13] (03CR) 10Marostegui: [C: 032] db1067.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/433692 (https://phabricator.wikimedia.org/T194852) (owner: 10Marostegui) [05:50:51] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [05:56:04] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.2 with snmp version 2 [05:57:03] PROBLEM - NTP peers on dns5001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:57:04] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 [05:57:53] RECOVERY - NTP peers on dns5001 is OK: NTP OK: Offset 0.001678 secs [06:01:23] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [06:02:23] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 [06:02:30] Any maintenance on going or something? XioNoX ^ [06:03:01] marostegui: high packet loss on our eqsin-codfw link, I'm in the middle of an email to their support [06:03:24] \o/ [06:03:25] thanks :) [06:04:40] https://smokeping.wikimedia.org/smokeping.cgi?target=eqsin.Hosts.bast5001 [06:04:45] smokeping is ugly [06:04:59] will change links metrics in a few to fail traffic over the other link [06:06:14] PROBLEM - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.129 for 1.3.6.1.2.1.2.2.1.2 with snmp version 2 [06:08:43] RECOVERY - Router interfaces on cr1-eqsin is OK: OK: host 103.102.166.129, interfaces up: 75, down: 0, dormant: 0, excluded: 0, unused: 0 [06:14:51] !log bumping eqsin-codfw link OSPF metric to 5000 (due to packet loss on link) [06:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:00] (03PS1) 10Jcrespo: mariadb: Repool db1105:s6 (checked) and pool all vslow hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433697 [08:24:01] (03CR) 10Marostegui: [C: 031] mariadb: Repool db1105:s6 (checked) and pool all vslow hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433697 (owner: 10Jcrespo) [08:28:21] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1105:s6 (checked) and pool all vslow hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433697 (owner: 10Jcrespo) [08:28:43] are we alone? [08:29:12] looks so :) [08:29:37] time to bring down wikipedia, I guess? [08:29:47] (03Merged) 10jenkins-bot: mariadb: Repool db1105:s6 (checked) and pool all vslow hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433697 (owner: 10Jcrespo) [08:30:03] (03CR) 10jenkins-bot: mariadb: Repool db1105:s6 (checked) and pool all vslow hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433697 (owner: 10Jcrespo) [08:32:17] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1105:s6 and other vslow hosts (duration: 01m 21s) [08:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:58] (03PS1) 10Jcrespo: mariadb: Depool db1082 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433700 [08:44:03] (03PS2) 10Jcrespo: mariadb: Depool db1085 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433700 [08:55:13] PROBLEM - Disk space on maps1002 is CRITICAL: DISK CRITICAL - free space: /srv 54756 MB (3% inode=99%) [08:56:46] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1085 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433700 (owner: 10Jcrespo) [08:58:26] (03Merged) 10jenkins-bot: mariadb: Depool db1085 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433700 (owner: 10Jcrespo) [08:59:26] (03CR) 10jenkins-bot: mariadb: Depool db1085 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433700 (owner: 10Jcrespo) [08:59:34] PROBLEM - Disk space on maps1003 is CRITICAL: DISK CRITICAL - free space: /srv 53718 MB (3% inode=99%) [09:01:06] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1085 (duration: 01m 20s) [09:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:58] ^gehel issues with map servers? [09:02:19] jynus: I'm on it [09:02:43] should go back to normal in a bit, let me ack that alert [09:03:14] ok, no problem, just got worried for a second [09:03:38] ACKNOWLEDGEMENT - Disk space on maps1002 is CRITICAL: DISK CRITICAL - free space: /srv 44775 MB (3% inode=99%): Gehel reducing replication factor of cassandra v3 keyspace [09:03:38] ACKNOWLEDGEMENT - Disk space on maps1003 is CRITICAL: DISK CRITICAL - free space: /srv 47987 MB (3% inode=99%): Gehel reducing replication factor of cassandra v3 keyspace [09:04:00] that's the aftermath of adding i18n, we still have a duplicated keyspace... [09:04:05] so more disk usage than usual [09:04:38] still, isn't 1.5 TB a bit low for those servers? [09:05:40] jynus: it was fine until recently... I suspect there is also some cleanup we could do on the postgres side... [09:06:30] I don't think we buy data servers with less than 4TB these days [09:06:48] different world :) [09:07:06] the dataset for maps is pretty much bounded... [09:07:19] but isn't it osm? [09:07:28] so growing constantly? [09:07:46] to some extent yes [09:07:47] or am I confusing things- I don't want to bother you preciselly now [09:08:43] "plain OSM XML variant takes over 913 GB" [09:09:13] 913GB? that seems a bit much [09:09:23] compared to the disk usage I have seen so far [09:09:25] (03PS1) 10Mark Bergsma: Extend unit testing of RunCommand [debs/pybal] - 10https://gerrit.wikimedia.org/r/433702 [09:09:37] well, that is xml uncompressed [09:09:54] I expect binary form to be much less + then overhead, etc [09:10:59] your postrgress usage seems to be 644G, which is not that far off (in the same scale) [09:12:09] jynus: yep, but it has been growing faster than I was expecting, I suspect we have something not being cleaned up as it should [09:18:54] (03PS3) 10Addshore: Wikidata dispatch, remove cron params, use values from mediawiki-config [puppet] - 10https://gerrit.wikimedia.org/r/430923 [09:19:33] (03PS3) 10Addshore: Wikidata dispatch, set defaults for dispatchChanges settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430921 [09:19:47] (03PS3) 10Addshore: Wikidata dispatch, disable dispatching for testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430924 [09:20:07] (03PS7) 10Addshore: Wikidata dispatch, Use a LockManager with short TTL for testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395967 (https://phabricator.wikimedia.org/T178652) [09:20:22] (03PS3) 10Addshore: Revert "Wikidata dispatch, disable dispatching for testwikidatawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430925 [09:21:16] _joe_: greg-g ^^ thats the "evil" chain of stuff :D [09:24:04] (03PS7) 10Mark Bergsma: Add unit tests for ProxyFetchMonitoringProtocol [debs/pybal] - 10https://gerrit.wikimedia.org/r/430339 [09:24:06] (03PS2) 10Mark Bergsma: Avoid Deferred.cancel() induced CancelledErrors [debs/pybal] - 10https://gerrit.wikimedia.org/r/433364 [09:24:08] (03PS3) 10Mark Bergsma: Handle HTTP status 302 and 303 as well as 301 [debs/pybal] - 10https://gerrit.wikimedia.org/r/430393 (https://phabricator.wikimedia.org/T102393) [09:24:10] (03PS3) 10Mark Bergsma: Add full unit test coverage of IdleConnection [debs/pybal] - 10https://gerrit.wikimedia.org/r/433341 [09:24:12] (03PS2) 10Mark Bergsma: Cleanup monitor shutdown handler (invoking stop) after run [debs/pybal] - 10https://gerrit.wikimedia.org/r/433369 [09:24:14] (03PS2) 10Mark Bergsma: Split monitor tests into separate modules [debs/pybal] - 10https://gerrit.wikimedia.org/r/433370 [09:24:16] (03PS2) 10Mark Bergsma: Extend unit testing of RunCommand [debs/pybal] - 10https://gerrit.wikimedia.org/r/433702 [09:24:41] !log drop v3 keyspace on cassandra maps (unused since migration to i18n) [09:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:37] !log stop and reimage db1085 [09:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:40] (03CR) 10Mark Bergsma: [C: 031] Add unit tests for ProxyFetchMonitoringProtocol [debs/pybal] - 10https://gerrit.wikimedia.org/r/430339 (owner: 10Mark Bergsma) [09:28:09] (03CR) 10Mark Bergsma: [C: 031] Avoid Deferred.cancel() induced CancelledErrors [debs/pybal] - 10https://gerrit.wikimedia.org/r/433364 (owner: 10Mark Bergsma) [09:29:03] ACKNOWLEDGEMENT - MariaDB Slave IO: s6 on db1102 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1085.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1085.eqiad.wmnet (111 Connection refused) Jcrespo reimage db1085 [09:29:07] (03CR) 10Mark Bergsma: [C: 031] Handle HTTP status 302 and 303 as well as 301 [debs/pybal] - 10https://gerrit.wikimedia.org/r/430393 (https://phabricator.wikimedia.org/T102393) (owner: 10Mark Bergsma) [09:32:07] (03CR) 10Mark Bergsma: [C: 031] Add full unit test coverage of IdleConnection [debs/pybal] - 10https://gerrit.wikimedia.org/r/433341 (owner: 10Mark Bergsma) [09:33:14] RECOVERY - Disk space on maps1002 is OK: DISK OK [09:33:59] !log cleared v3 snapshot on maps servers [09:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:03] RECOVERY - Disk space on maps1003 is OK: DISK OK [09:34:21] jynus: and disk is now OK again... [09:34:28] * gehel is still learning about cassandra... [09:35:08] ACKNOWLEDGEMENT - MariaDB Slave Lag: s6 on db1102 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 467.80 seconds Jcrespo reimage db1085 [09:35:32] (03CR) 10Mark Bergsma: [C: 031] Split monitor tests into separate modules [debs/pybal] - 10https://gerrit.wikimedia.org/r/433370 (owner: 10Mark Bergsma) [09:35:33] PROBLEM - MegaRAID on db1054 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [09:37:48] ACKNOWLEDGEMENT - MegaRAID on db1054 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough Marostegui T194867 [09:38:20] (03CR) 10Mark Bergsma: [C: 031] Extend unit testing of RunCommand [debs/pybal] - 10https://gerrit.wikimedia.org/r/433702 (owner: 10Mark Bergsma) [09:39:32] (03PS3) 10Marostegui: db1120: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/433331 [09:40:17] (03CR) 10Marostegui: [C: 032] db1120: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/433331 (owner: 10Marostegui) [09:42:44] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T194907#4212678 (10Volans) @Dzahn That usually happens if the alarm flap on icinga for some reason, the handler open a new task for each CRITICAL/HARD triggered by Icinga. I'll check with the Cloud team though because... [09:42:54] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1085 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433703 [09:43:03] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T194851#4213070 (10Volans) [09:43:06] 10Operations, 10ops-eqiad: Degraded RAID on labvirt1019 - https://phabricator.wikimedia.org/T194907#4213072 (10Volans) [09:43:45] !log Updated the Wikidata property suggester with data from Monday's JSON dump and applied the T132839 workarounds [09:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:51] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [09:44:12] (03PS1) 10Marostegui: db2075.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/433704 [09:44:55] !log Stop MySQL on db2092 for testing [09:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:29] (03CR) 10Marostegui: [C: 032] db2075.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/433704 (owner: 10Marostegui) [09:47:30] !log Stop MySQL on db1116 for testing [09:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:40] (03Abandoned) 10Marostegui: misc.my.cnf.erb: Enable barracuda and innodb_strict_mode [puppet] - 10https://gerrit.wikimedia.org/r/321638 (https://phabricator.wikimedia.org/T150949) (owner: 10Marostegui) [09:53:17] 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4213084 (10WMDE-Fisch) >>! In T190717#4210310, @MoritzMuehlenhoff wrote: >>>! In T190717#4210047, @Lea_WMDE wrote: >> Hi @MoritzMuehlenhoff,... [09:59:33] (03PS1) 10Jcrespo: mariadb: Repool db1085 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433705 [10:00:47] (03PS2) 10Jcrespo: mariadb: Repool db1085 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433705 [10:05:11] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1085 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433705 (owner: 10Jcrespo) [10:06:40] (03Merged) 10jenkins-bot: mariadb: Repool db1085 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433705 (owner: 10Jcrespo) [10:09:15] (03CR) 10jenkins-bot: mariadb: Repool db1085 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433705 (owner: 10Jcrespo) [10:15:46] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1085 with low load (duration: 01m 20s) [10:15:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:46] RECOVERY - MegaRAID on db1054 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [12:12:53] 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4213402 (10Lea_WMDE) >>! In T190717#4213084, @WMDE-Fisch wrote: >>>! In T190717#4210310, @MoritzMuehlenhoff wrote: >>> - For the rollout (of... [12:26:10] (03CR) 10Alexandros Kosiaris: [C: 031] network/tcpircbot/kubernetes: add deploy2001 to allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/433637 (https://phabricator.wikimedia.org/T193916) (owner: 10Dzahn) [12:26:35] !log stop and reimage db2041 [12:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:03] (03PS1) 10Imarlier: webperf: separate permissions from specific apps [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) [12:29:43] (03CR) 10jerkins-bot: [V: 04-1] webperf: separate permissions from specific apps [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [12:32:33] (03PS2) 10Imarlier: webperf: separate permissions from specific apps [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) [12:33:07] (03CR) 10jerkins-bot: [V: 04-1] webperf: separate permissions from specific apps [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [12:35:48] db1054 is flapping- as there seems not to be high impact so far, I am going to downtime it until monday morning CC marostegui [12:35:55] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#3991415 (10ayounsi) [12:39:26] PROBLEM - MegaRAID on db1054 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [12:46:12] bblack, about? [12:52:17] (03PS3) 10Imarlier: webperf: separate permissions from specific apps [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) [12:53:01] (03CR) 10jerkins-bot: [V: 04-1] webperf: separate permissions from specific apps [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [12:54:50] (03PS4) 10Imarlier: webperf: separate permissions from specific apps [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) [12:55:25] (03CR) 10jerkins-bot: [V: 04-1] webperf: separate permissions from specific apps [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [12:57:10] Krenair: yes [12:57:41] bblack, want to talk about secure redirect service? [12:58:47] (03PS5) 10Imarlier: webperf: separate permissions from specific apps [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) [12:59:06] Krenair: I suppose that's a good topic for here :) Where are you at? [12:59:09] /wmy irc bounce traffic is being super laggy [12:59:21] (03CR) 10jerkins-bot: [V: 04-1] webperf: separate permissions from specific apps [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [12:59:21] you know along the corridor from reception? [12:59:33] there's benches on the right hand side with power, I'm on the second one [12:59:53] ok [13:01:26] (03PS6) 10Imarlier: webperf: separate permissions from specific apps [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) [13:02:12] (03CR) 10jerkins-bot: [V: 04-1] webperf: separate permissions from specific apps [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [13:09:08] (03PS7) 10Imarlier: webperf: separate permissions from specific apps [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) [13:09:47] (03CR) 10jerkins-bot: [V: 04-1] webperf: separate permissions from specific apps [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [13:10:01] RECOVERY - MegaRAID on db1054 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [13:15:40] (03CR) 10Imarlier: "@dzahn Could use your advice on this. All I"m trying to do is:" [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [13:19:16] !log reset 2FA for Trizek_(WMF) [13:19:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:30] 10Operations, 10Traffic: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4213622 (10Vgutierrez) @MaxBioHazard so.. this is one of your bots talking proper TLS 1.2: ```- ReqHeader X-Connection-Properties: H2=0; SSR=0; SSL=TLSv1.2; C=ECDHE-ECDSA-... [13:43:01] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1085 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433703 [13:46:21] PROBLEM - Device not healthy -SMART- on db1066 is CRITICAL: cluster=mysql device=megaraid,6 instance=db1066:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1066&var-datasource=eqiad%2520prometheus%252Fops [13:50:56] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4213769 (10Marostegui) We are all set for doing the copies to the new hardware once it arrives. eqiad: db1116: s1, s3, s5, s8... [13:59:56] !log Manually fail disk #6 on db1066 - T194870 [14:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:01] T194870: Failover s2 primary master - https://phabricator.wikimedia.org/T194870 [14:09:49] :-) [14:10:15] I will apply the partitioning to db1105 when it catches up replication [14:13:32] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4213808 (10jcrespo) One thing we could fix at the same time was the configuration of the triggers to write to the binlog- I am... [14:25:41] PROBLEM - MegaRAID on db1066 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [14:25:42] ACKNOWLEDGEMENT - MegaRAID on db1066 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T194955 [14:25:47] 10Operations, 10ops-eqiad: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T194955#4213873 (10ops-monitoring-bot) [14:29:06] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T194955#4213884 (10Marostegui) p:05Triage>03Normal a:03Cmjohnson Already talked to @Cmjohnson - he will replace it today. I manually failed it. [14:39:59] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Investigate and improve memory allocation rates of WDQS - https://phabricator.wikimedia.org/T181988#4213918 (10Gehel) Investigation on T192759 lead to some interesting discoveries. Blazegraph Jour... [15:10:10] PROBLEM - Disk space on maps1002 is CRITICAL: DISK CRITICAL - free space: /srv 55197 MB (3% inode=99%) [15:17:01] PROBLEM - keystone public endoint port 5000 on labtestcontrol2001 is CRITICAL: connect to address 208.80.153.47 and port 5000: Connection refused [15:17:40] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2001 is CRITICAL: connect to address 208.80.153.47 and port 35357: Connection refused [15:17:41] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: connect to address 208.80.153.75 and port 35357: Connection refused [15:17:57] gehel: FYI maps1002 ^^^ [15:18:10] PROBLEM - keystone public endoint port 5000 on labtestcontrol2003 is CRITICAL: connect to address 208.80.153.75 and port 5000: Connection refused [15:18:15] volans: thanks! looking [15:20:25] damn, I did some major cleanup this morning, but it looks like something is going on... [15:23:17] !log clear cassandra snapshots on maps1002 [15:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:10] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 0.079 second response time [15:24:30] RECOVERY - keystone public endoint port 5000 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 757 bytes in 0.081 second response time [15:24:44] Hello, anybody with deploy privileges around? [15:25:17] Yup [15:25:23] I'm about to deploy something VE [15:26:09] Reedy, good. I'm going to upload a throttle exception. [15:30:20] RECOVERY - Disk space on maps1002 is OK: DISK OK [15:31:40] !log rolling restart of cassandra on maps1* (repair was started on each node, instead of sequentially) [15:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:10] PROBLEM - Disk space on furud is CRITICAL: DISK CRITICAL - free space: /mnt/1a 1308891 MB (3% inode=96%): /mnt/2a 1248862 MB (3% inode=96%) [15:38:38] !log reedy@tin Synchronized php-1.32.0-wmf.3/extensions/VisualEditor/: Fix dialog (duration: 01m 25s) [15:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:21] RECOVERY - Memory correctable errors -EDAC- on cp1068 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=cp1068&var-datasource=eqiad%2520prometheus%252Fops [15:40:40] Urbanecm: patch ready? [15:40:57] (03PS1) 10Urbanecm: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433733 (https://phabricator.wikimedia.org/T194888) [15:41:00] Reedy, now [15:41:05] See above [15:41:29] Sorry, I must tunnel all the services needed (gerrit, IRC) over SSH because hackathon venue blocks required ports [15:41:48] Use the web ui? ;0 [15:42:04] (03CR) 10Reedy: [C: 032] New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433733 (https://phabricator.wikimedia.org/T194888) (owner: 10Urbanecm) [15:42:15] Reedy, SSH tunel is one way, web UI another one .D [15:42:17] :D [15:43:40] (03Merged) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433733 (https://phabricator.wikimedia.org/T194888) (owner: 10Urbanecm) [15:43:55] (03CR) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433733 (https://phabricator.wikimedia.org/T194888) (owner: 10Urbanecm) [15:44:43] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Possibly BBU issues on db1067 - https://phabricator.wikimedia.org/T194852#4214123 (10Marostegui) Still looking good after 10 hours: ``` root@db1067:~# megacli -AdpBbuCmd -a0 | grep Temper Temperature: 47 C Temperature : O... [15:44:45] !log Stop MySQL and reboot db1067 - T194852 [15:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:50] T194852: Possibly BBU issues on db1067 - https://phabricator.wikimedia.org/T194852 [15:45:27] !log reedy@tin Synchronized wmf-config/throttle.php: throttling! (duration: 01m 22s) [15:45:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:33] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4214145 (10Marostegui) [15:49:55] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4081506 (10Marostegui) [15:49:59] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db112[45].eqiad.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194780#4208342 (10Marostegui) [15:50:03] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4081506 (10Marostegui) [15:50:06] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db209[45].codfw.wmnet (sanitarium expansion) - https://phabricator.wikimedia.org/T194781#4208357 (10Marostegui) [15:52:11] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Possibly BBU issues on db1067 - https://phabricator.wikimedia.org/T194852#4214156 (10Marostegui) For the record, after the reboot: ``` root@db1067:~# megacli -AdpBbuCmd -a0 | grep Temper Temperature: 48 C Temperature ``` [15:55:02] (03PS1) 10Rush: openstack: labtest use labtestcontrol2003 for keystone [puppet] - 10https://gerrit.wikimedia.org/r/433734 (https://phabricator.wikimedia.org/T167559) [15:55:13] 10Operations, 10Traffic: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962#4214163 (10BBlack) p:05Triage>03Normal [15:56:12] 10Operations, 10Traffic: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962#4214178 (10BBlack) [15:56:22] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#4214177 (10BBlack) [15:57:21] (03CR) 10Rush: "labtestcontrol2001.wikimedia.org,labtestcontrol2003.wikimedia.org,labtestvirt2001.codfw.wmnet,labtestvirt2003.codfw.wmnet,labtestservices2" [puppet] - 10https://gerrit.wikimedia.org/r/433734 (https://phabricator.wikimedia.org/T167559) (owner: 10Rush) [15:57:32] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#2235376 (10BBlack) Note added parent above, T194962. We're going to try to develop the generic solution for LE first, and then... [15:58:58] (03PS1) 10Mark Bergsma: Add unit testing for BGP Factory classes [debs/pybal] - 10https://gerrit.wikimedia.org/r/433735 [15:59:01] !log reedy@tin Synchronized php-1.32.0-wmf.4/extensions/VisualEditor/: Fix dialog (duration: 01m 19s) [15:59:02] (03PS1) 10Mark Bergsma: Small fixes [debs/pybal] - 10https://gerrit.wikimedia.org/r/433736 [15:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:36] 10Operations, 10ops-eqiad: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4214198 (10Bstorm) [15:59:55] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Ban clients of WDQS which don't follow throttling directives for some time - https://phabricator.wikimedia.org/T194653#4214211 (10Halfak) a:03Gehel [16:00:02] 10Operations, 10Traffic: gdnsd plugin support for ACME DNS challenges - https://phabricator.wikimedia.org/T194965#4214212 (10BBlack) p:05Triage>03Normal [16:00:46] !log reduce replication of maps v4 keyspace to 3 [16:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:08] (03PS1) 10Dzahn: webperf::profiling_tools: add webperf admin groups [puppet] - 10https://gerrit.wikimedia.org/r/433738 (https://phabricator.wikimedia.org/T194390) [16:05:05] (03CR) 10Dzahn: [C: 032] webperf::profiling_tools: add webperf admin groups [puppet] - 10https://gerrit.wikimedia.org/r/433738 (https://phabricator.wikimedia.org/T194390) (owner: 10Dzahn) [16:05:08] 10Operations, 10Discovery, 10Maps: disk usage increase on maps servers - https://phabricator.wikimedia.org/T194966#4214254 (10Gehel) [16:09:45] (03PS2) 10Dzahn: webperf::profiling_tools: add perfteam admins [puppet] - 10https://gerrit.wikimedia.org/r/433738 (https://phabricator.wikimedia.org/T194390) [16:11:52] (03PS3) 10Dzahn: webperf::profiling_tools: add perfteam admins [puppet] - 10https://gerrit.wikimedia.org/r/433738 (https://phabricator.wikimedia.org/T194390) [16:15:15] (03PS1) 10Giuseppe Lavagetto: Attempt at fixing the debianization [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/433739 [16:17:21] (03PS1) 10Smalyshev: Add to the list all wikis except for ptivate ones. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433740 (https://phabricator.wikimedia.org/T194260) [16:28:50] bblack, how do you feel about python-twisted? [16:31:07] (03Abandoned) 10Alex Monk: POC: Secure redirect service [puppet] - 10https://gerrit.wikimedia.org/r/317450 (https://phabricator.wikimedia.org/T133548) (owner: 10Alex Monk) [16:33:31] Krenair: well... if I have to do the code review... I'd avoid it :P [16:33:52] (03PS2) 10Giuseppe Lavagetto: Attempt at fixing the debianization [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/433739 [16:34:23] Krenair: maybe you could consider https://docs.python.org/3/library/asyncio.html [16:34:58] twisted is used by pybal, afaik [16:35:21] (03CR) 10Dzahn: [C: 032] webperf::profiling_tools: add perfteam admins [puppet] - 10https://gerrit.wikimedia.org/r/433738 (https://phabricator.wikimedia.org/T194390) (owner: 10Dzahn) [16:35:26] mutante: it is :) [16:37:39] vgutierrez, hm, okay [16:38:11] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#4214391 (10Dzahn) @imarlier @Krinkle I forgot about the shell access part. https://gerrit.wikimedia.org/r/#/c/433738/ added the perfteam ad... [16:45:39] (03CR) 10Dzahn: "i added the admin group in https://gerrit.wikimedia.org/r/#/c/433738/" [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [16:46:44] 10Operations, 10DNS, 10Mail, 10Patch-For-Review: Outbound mail from Greenhouse is broken - https://phabricator.wikimedia.org/T189065#4214466 (10herron) >>! In T189065#4212720, @bbogaert wrote: > 1 pm Pacific on Monday, May 21. I'll be doing the green house changes with her then. Sounds good! Added a remi... [16:53:41] (03PS1) 10Ppchelko: Provide EventBus URI to change-prop profile [puppet] - 10https://gerrit.wikimedia.org/r/433745 [16:54:44] (03CR) 10Ottomata: [C: 031] Provide EventBus URI to change-prop profile [puppet] - 10https://gerrit.wikimedia.org/r/433745 (owner: 10Ppchelko) [16:58:10] (03PS1) 10Ppchelko: Expose the revision-score event publically [puppet] - 10https://gerrit.wikimedia.org/r/433746 [16:59:01] (03PS3) 10Giuseppe Lavagetto: Attempt at fixing the debianization [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/433739 [17:09:23] (03PS3) 10Dzahn: network/tcpircbot/kubernetes: add deploy2001 to allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/433637 (https://phabricator.wikimedia.org/T193916) [17:14:18] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T194955#4214576 (10Marostegui) Thanks Chris ``` root@db1066:~# megacli -PDRbld -ShowProg -PhysDrv [32:6] -aALL Rebuild Progress on Device at Enclosure 32, Slot 6 Completed 8% in 9 Minutes. ``` [17:20:19] 10Operations, 10Discovery, 10Maps: disk usage increase on maps servers - https://phabricator.wikimedia.org/T194966#4214642 (10Gehel) Cassandra might have been running into compaction issues while we had both v3 and v4 keyspaces, and not enough space to run compaction. Thought I don't see any error in cassand... [17:23:42] (03CR) 10Dzahn: [C: 032] network/tcpircbot/kubernetes: add deploy2001 to allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/433637 (https://phabricator.wikimedia.org/T193916) (owner: 10Dzahn) [17:24:22] jouncebot: next [17:24:22] In 65 hour(s) and 35 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180521T1100) [17:24:47] (03PS4) 10Giuseppe Lavagetto: Fix debianization for python3 [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/433739 [17:24:50] 10Operations, 10Discovery, 10Maps: unban reguyla - https://phabricator.wikimedia.org/T194966#4214700 (10StjnVMF) [17:24:53] 10Operations, 10Traffic: unban reguyla - https://phabricator.wikimedia.org/T194965#4214702 (10StjnVMF) [17:24:55] 10Operations, 10ops-eqiad: unban reguyla - https://phabricator.wikimedia.org/T194964#4214704 (10StjnVMF) [17:26:44] 10Operations, 10Traffic: gdnsd plugin support for ACME DNS challenge - https://phabricator.wikimedia.org/T194965#4214714 (10Paladox) [17:26:56] 10Operations, 10Traffic: gdnsd plugin support for ACME DNS challenges - https://phabricator.wikimedia.org/T194965#4214212 (10Paladox) [17:27:52] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix debianization for python3 [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/433739 (owner: 10Giuseppe Lavagetto) [17:27:54] th paladox <3 [17:27:57] *thx [17:28:02] your welcome :) [17:28:32] 10Operations, 10ops-eqiad: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4214718 (10JJMC89) [17:29:42] 10Operations, 10Discovery, 10Maps: disk usage increase on maps servers - https://phabricator.wikimedia.org/T194966#4214720 (10JJMC89) [17:31:31] PROBLEM - Check systemd state on einsteinium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:41:10] RECOVERY - Device not healthy -SMART- on db1066 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1066&var-datasource=eqiad%2520prometheus%252Fops [17:41:25] checks einsteinium [17:41:55] (03PS1) 10Faidon Liambotis: "Update" netbox to 2.2.4 + WMF patches [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/433757 [17:42:17] (03CR) 10Faidon Liambotis: [C: 032] "Update" netbox to 2.2.4 + WMF patches [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/433757 (owner: 10Faidon Liambotis) [17:42:34] (03CR) 10Faidon Liambotis: [V: 032 C: 032] "Update" netbox to 2.2.4 + WMF patches [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/433757 (owner: 10Faidon Liambotis) [17:45:05] !log faidon@tin Started deploy [netbox/deploy@90164b3]: Update netbox to 2.2.4 + WMF patches [17:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:12] ehm.. i added a new DNS record yesterday, using host or dig i can see it, but ferm says the lookup fails.. [17:45:37] !log faidon@tin Finished deploy [netbox/deploy@90164b3]: Update netbox to 2.2.4 + WMF patches (duration: 00m 32s) [17:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:37] !log faidon@tin Started deploy [netbox/deploy@90164b3]: Update netbox to 2.2.4 + WMF patches [17:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:42] !log faidon@tin Finished deploy [netbox/deploy@90164b3]: Update netbox to 2.2.4 + WMF patches (duration: 00m 05s) [17:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:46] !log faidon@tin Started deploy [netbox/deploy@90164b3]: Update netbox to 2.2.4 + WMF patches [17:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:51] !log faidon@tin Finished deploy [netbox/deploy@90164b3]: Update netbox to 2.2.4 + WMF patches (duration: 00m 05s) [17:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:30] PROBLEM - Check systemd state on neon is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:56:50] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:57:25] (03PS1) 10Dzahn: add IPv6 records for deploy2001 [dns] - 10https://gerrit.wikimedia.org/r/433759 (https://phabricator.wikimedia.org/T193916) [17:57:37] (03CR) 10jerkins-bot: [V: 04-1] add IPv6 records for deploy2001 [dns] - 10https://gerrit.wikimedia.org/r/433759 (https://phabricator.wikimedia.org/T193916) (owner: 10Dzahn) [17:58:05] 10Operations, 10ops-eqiad, 10Cloud-Services: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4215028 (10Bstorm) [17:59:03] (03PS2) 10Dzahn: add IPv6 records for deploy2001 [dns] - 10https://gerrit.wikimedia.org/r/433759 (https://phabricator.wikimedia.org/T193916) [17:59:24] this will fix the puppet errors on neon and einsteinium.. just a sec [18:00:49] (03CR) 10Dzahn: [C: 032] add IPv6 records for deploy2001 [dns] - 10https://gerrit.wikimedia.org/r/433759 (https://phabricator.wikimedia.org/T193916) (owner: 10Dzahn) [18:02:29] 10Operations, 10ops-eqiad, 10Cloud-Services: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4214198 (10chasemp) I tried to bring the eth1 interfaces up and no dice. My thought is they are not connected. [18:03:43] !log stop replication and start schema change on db1105 [18:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:40] 10Operations, 10ops-eqiad, 10Cloud-Services: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4215139 (10Bstorm) a:03Cmjohnson [18:06:53] 10Operations, 10ops-eqiad, 10Cloud-Services: Connect or troubleshoot eth1 on labvirt1019 and labvirt1020 - https://phabricator.wikimedia.org/T194964#4215194 (10chasemp) eth1 on both should be connected and configured to be in the cloud-instance-ports interface-range which makes them trunks that pass the inst... [18:09:01] RECOVERY - Check systemd state on einsteinium is OK: OK - running: The system is fully operational [18:09:28] !log einsteinium: started ferm service [18:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:31] RECOVERY - Check systemd state on neon is OK: OK - running: The system is fully operational [18:11:39] !log neon, netmon1002 - start ferm service [18:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:46] 10Operations, 10Discovery, 10Maps: Track more detailed disk usage - https://phabricator.wikimedia.org/T194997#4215213 (10Pnorman) [18:14:26] 10Operations, 10Discovery, 10Maps: disk usage increase on maps servers - https://phabricator.wikimedia.org/T194966#4215229 (10Pnorman) Points from IRC conversation - v3 keyspace has already been removed from both - cassandra compaction is manually running and recovering space - pg_xlog takes 30GB - we have... [18:18:01] 10Operations, 10Cloud-VPS: Create custom deployment-prep role that allows editing of Designate records only - https://phabricator.wikimedia.org/T194998#4215239 (10Krenair) p:05Triage>03Normal [18:37:21] (03PS1) 10Pnorman: Move process-osm-data example URLs to https [puppet] - 10https://gerrit.wikimedia.org/r/433776 (https://phabricator.wikimedia.org/T190193) [18:39:31] paravoid: on netmon servers, the uwsgi service fails to start. could that be related to netbox deploy? there also was a temp problem with the ferm service that was caused by me but i fixed it already [18:41:01] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:43:15] arturo: ^ [18:43:26] arturo: oh sorry that's not the server I was thinking :D [18:43:32] (03CR) 10Adam Johanes: [C: 04-1] "NotASpy paladox JustBerry Krenair Sagan samwilson Reedy: merge with master!" [puppet] - 10https://gerrit.wikimedia.org/r/433776 (https://phabricator.wikimedia.org/T190193) (owner: 10Pnorman) [18:46:27] grr [18:46:29] i got pingged 33 times by that user [18:47:17] paladox: do you know if gerrit has any rate limiting [18:47:30] bawolff i can have a look, but doin't think so [18:47:39] all accounts are loaded from ldap into the db [18:47:50] (when the user logins it creates a db entry) [18:52:20] RECOVERY - MegaRAID on db1066 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [18:55:32] (03PS1) 10Bawolff: Revert "network/tcpircbot/kubernetes: add deploy2001 to allowed hosts" [puppet] - 10https://gerrit.wikimedia.org/r/433806 [18:55:34] (03PS1) 10Bawolff: Revert "Fix debianization for python3" [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/433807 [18:55:51] (03Abandoned) 10Bawolff: Revert "network/tcpircbot/kubernetes: add deploy2001 to allowed hosts" [puppet] - 10https://gerrit.wikimedia.org/r/433806 (owner: 10Bawolff) [18:55:53] (03Abandoned) 10Bawolff: Revert "Fix debianization for python3" [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/433807 (owner: 10Bawolff) [18:56:44] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T194955#4215416 (10Marostegui) This is all good now ``` root@db1066:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Level : Prim... [18:57:08] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T194955#4215417 (10Marostegui) 05Open>03Resolved [18:57:51] (03PS1) 10Bawolff: Revert "add IPv6 records for deploy2001" [dns] - 10https://gerrit.wikimedia.org/r/433814 [18:57:58] (03PS1) 10Bawolff: Revert ""Update" netbox to 2.2.4 + WMF patches" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/433817 [18:58:31] (03Abandoned) 10Bawolff: Revert ""Update" netbox to 2.2.4 + WMF patches" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/433817 (owner: 10Bawolff) [18:58:37] (03Abandoned) 10Bawolff: Revert "add IPv6 records for deploy2001" [dns] - 10https://gerrit.wikimedia.org/r/433814 (owner: 10Bawolff) [18:59:09] (03PS1) 10Bawolff: Revert ""Update" netbox to 2.2.4 + WMF patches" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/433818 [19:01:11] !log unlock bawolff gerrit account [19:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:52] Reedy, SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -l mwdeploy@somehost [19:17:52] from https://wikitech.wikimedia.org/wiki/Keyholder [19:19:20] bd808, ^ to update [19:20:14] another spammer https://gerrit.wikimedia.org/r/#/q/owner:%22Thiemo+Kreuz+(WNDE)+%253Ctemp%2540gmail.com%253E%22 [19:21:57] blocked [19:22:15] thanks [19:22:25] ok he's using wmf users names now. [19:25:00] WNDE - Wikinews Germany! [19:25:38] 10Operations, 10Wikimedia-Mailing-lists: Provide a mean to mass discard/reject subscription requests on Wikimedia mailing lists - https://phabricator.wikimedia.org/T194669#4215488 (10MarcoAurelio) [19:27:51] bawolff: wtf https://gerrit.wikimedia.org/r/#/c/433802/ ? [19:27:57] I guess that's not you right? [19:28:11] email doesn't match [19:28:24] Hauskatze: no. it's a spammer [19:28:30] real bawolff on Gerrit is "Brian Wolff" [19:28:42] not bawolff or whatever. [19:28:45] a vandal I'd say [19:28:55] ok thanks, I feel better now, phew :) [19:29:12] yes, folks are dealing with it [19:30:13] he's been busy eh https://gerrit.wikimedia.org/r/#/q/owner:boduyuz%2540gifto12.com [19:36:13] !log tstarling manually loaded Tor Exit Nodes on wikitech [19:36:13] (03PS1) 10Reedy: Load Tor Exit Nodes on labswiki [puppet] - 10https://gerrit.wikimedia.org/r/433826 [19:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:32] (03CR) 10Alexandros Kosiaris: [C: 032] ircecho: replace base::service_unit with systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433686 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [19:36:38] (03PS2) 10Alexandros Kosiaris: ircecho: replace base::service_unit with systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433686 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [19:36:41] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ircecho: replace base::service_unit with systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/433686 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [19:36:43] (03CR) 10jerkins-bot: [V: 04-1] Load Tor Exit Nodes on labswiki [puppet] - 10https://gerrit.wikimedia.org/r/433826 (owner: 10Reedy) [19:37:56] (03PS2) 10Reedy: Load Tor Exit Nodes on labswiki [puppet] - 10https://gerrit.wikimedia.org/r/433826 [19:38:23] (03CR) 10jerkins-bot: [V: 04-1] Load Tor Exit Nodes on labswiki [puppet] - 10https://gerrit.wikimedia.org/r/433826 (owner: 10Reedy) [19:39:00] (03PS3) 10Reedy: Load Tor Exit Nodes on labswiki [puppet] - 10https://gerrit.wikimedia.org/r/433826 [19:39:26] (03CR) 10jerkins-bot: [V: 04-1] Load Tor Exit Nodes on labswiki [puppet] - 10https://gerrit.wikimedia.org/r/433826 (owner: 10Reedy) [19:40:38] (03PS4) 10Reedy: Load Tor Exit Nodes on labswiki [puppet] - 10https://gerrit.wikimedia.org/r/433826 [19:41:13] (03CR) 10jerkins-bot: [V: 04-1] Load Tor Exit Nodes on labswiki [puppet] - 10https://gerrit.wikimedia.org/r/433826 (owner: 10Reedy) [19:46:42] (03PS5) 10Reedy: Load Tor Exit Nodes on labswiki [puppet] - 10https://gerrit.wikimedia.org/r/433826 [19:47:26] (03CR) 10jerkins-bot: [V: 04-1] Load Tor Exit Nodes on labswiki [puppet] - 10https://gerrit.wikimedia.org/r/433826 (owner: 10Reedy) [19:48:11] (03PS6) 10Reedy: Load Tor Exit Nodes on labswiki [puppet] - 10https://gerrit.wikimedia.org/r/433826 [19:48:15] (03PS7) 10Reedy: Load Tor Exit Nodes on labswiki [puppet] - 10https://gerrit.wikimedia.org/r/433826 [19:48:57] (03CR) 10jerkins-bot: [V: 04-1] Load Tor Exit Nodes on labswiki [puppet] - 10https://gerrit.wikimedia.org/r/433826 (owner: 10Reedy) [19:50:20] (03PS8) 10Reedy: Load Tor Exit Nodes on labswiki [puppet] - 10https://gerrit.wikimedia.org/r/433826 [19:51:41] (03CR) 10BryanDavis: [C: 031] Load Tor Exit Nodes on labswiki [puppet] - 10https://gerrit.wikimedia.org/r/433826 (owner: 10Reedy) [19:52:02] (03CR) 10Tim Starling: [C: 032] Load Tor Exit Nodes on labswiki [puppet] - 10https://gerrit.wikimedia.org/r/433826 (owner: 10Reedy) [19:56:19] (03PS1) 10Reedy: ensure => true to ensure => 'present' in web.pp [puppet] - 10https://gerrit.wikimedia.org/r/433829 [19:57:37] (03CR) 10Tim Starling: [C: 032] ensure => true to ensure => 'present' in web.pp [puppet] - 10https://gerrit.wikimedia.org/r/433829 (owner: 10Reedy) [19:58:27] (03CR) 10BryanDavis: ensure => true to ensure => 'present' in web.pp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/433829 (owner: 10Reedy) [20:01:49] (03PS2) 10Smalyshev: Add to the list all wikis except for private ones. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433740 (https://phabricator.wikimedia.org/T194260) [20:03:26] (03PS1) 10Urbanecm: Initial configuration for pmswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433830 (https://phabricator.wikimedia.org/T194879) [20:04:58] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for pmswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433830 (https://phabricator.wikimedia.org/T194879) (owner: 10Urbanecm) [20:06:11] 10Operations, 10DNS, 10Mail, 10Patch-For-Review: Outbound mail from Greenhouse is broken - https://phabricator.wikimedia.org/T189065#4215606 (10bbogaert) Great, thanks! [20:07:41] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 783 bytes in 0.081 second response time [20:08:30] RECOVERY - keystone public endoint port 5000 on labtestcontrol2001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 781 bytes in 0.081 second response time [20:09:08] (03PS2) 10Urbanecm: Initial configuration for pmswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433830 (https://phabricator.wikimedia.org/T194879) [20:09:20] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2003 is CRITICAL: connect to address 208.80.153.75 and port 35357: Connection refused [20:09:41] PROBLEM - keystone public endoint port 5000 on labtestcontrol2003 is CRITICAL: connect to address 208.80.153.75 and port 5000: Connection refused [20:10:45] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for pmswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433830 (https://phabricator.wikimedia.org/T194879) (owner: 10Urbanecm) [20:19:56] (03PS3) 10Urbanecm: Initial configuration for pmswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433830 (https://phabricator.wikimedia.org/T194879) [20:21:01] RECOVERY - keystone public endoint port 5000 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 757 bytes in 0.081 second response time [20:21:34] (03CR) 10jerkins-bot: [V: 04-1] Initial configuration for pmswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433830 (https://phabricator.wikimedia.org/T194879) (owner: 10Urbanecm) [20:22:00] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2003 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 759 bytes in 0.078 second response time [20:24:01] (03PS4) 10Urbanecm: Initial configuration for pmswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/433830 (https://phabricator.wikimedia.org/T194879) [20:53:36] (03PS1) 10Chad: Quota plugin @ stable-2.14 [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/433889 [21:10:22] (03CR) 10Chad: [V: 032 C: 032] Quota plugin @ stable-2.14 [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/433889 (owner: 10Chad) [21:15:44] (03PS1) 10Chad: Adding quota plugin @ stable-2.14 [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/433907 [21:16:16] (03CR) 10Chad: [V: 032 C: 032] Adding quota plugin @ stable-2.14 [software/gerrit] (stable-2.14) - 10https://gerrit.wikimedia.org/r/433907 (owner: 10Chad) [21:16:50] !log demon@tin Started deploy [gerrit/gerrit@a07d943]: quota plugin [21:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:02] !log demon@tin Finished deploy [gerrit/gerrit@a07d943]: quota plugin (duration: 00m 11s) [21:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:22] 10Operations, 10SRE-Access-Requests, 10Discovery-Search (Current work): Google Search Console access for Search Platform team - https://phabricator.wikimedia.org/T188453#4215826 (10mpopov) @EBjune @RobH: BTW @JKatzWMF and I are going to be changing which properties are tracked in GSC (namely getting rid of a... [21:45:31] !log updated some OAuth consumer for a hackathon project: update oauth_registered_consumer set oarc_callback_url = 'http://localhost' where oarc_consumer_key = '2828bd9ca9bdcd81a960721819f25e90'; [21:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:25] err, commons is down for me [22:56:58] works for me [22:57:04] MaxSem ^^ [22:58:06] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [22:58:07] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:58:29] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:58:36] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:58:36] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:58:37] PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [22:58:37] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [22:58:44] looks like not everything ;) [22:58:47] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:58:58] bblack: ema ^ [22:59:16] PROBLEM - HTTP availability for Varnish at eqiad on einsteinium is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [22:59:36] PROBLEM - HTTP availability for Varnish at codfw on einsteinium is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:00:06] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [23:00:06] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [23:00:26] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [23:00:33] oh esams is what i connect through but commons still works [23:00:36] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [23:00:47] RECOVERY - HTTP availability for Varnish at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:00:56] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:01:06] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [23:01:16] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [23:01:46] RECOVERY - HTTP availability for Varnish at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:01:56] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:01:57] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:02:26] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:02:26] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:02:26] RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:02:26] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [23:02:36] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [23:07:36] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [23:07:37] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [23:07:57] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [23:08:06] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [23:08:39] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [23:08:46] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [23:11:20] (03PS1) 10Aaron Schulz: [WIP] Enable mcrouter on mediawiki memcached nodes [puppet] - 10https://gerrit.wikimedia.org/r/433913 (https://phabricator.wikimedia.org/T194225) [23:12:06] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Enable mcrouter on mediawiki memcached nodes [puppet] - 10https://gerrit.wikimedia.org/r/433913 (https://phabricator.wikimedia.org/T194225) (owner: 10Aaron Schulz)