[00:01:24] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add security alg/forwarding-options/screen to mr template [homer/public] - 10https://gerrit.wikimedia.org/r/550356 (owner: 10Ayounsi) [00:02:34] (03PS2) 10Ayounsi: Add security alg/forwarding-options/screen to mr template [homer/public] - 10https://gerrit.wikimedia.org/r/550356 [00:03:26] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Add security alg/forwarding-options/screen to mr template [homer/public] - 10https://gerrit.wikimedia.org/r/550356 (owner: 10Ayounsi) [00:17:07] (03PS3) 10Faidon Liambotis: Automatically cast network strings to ipaddress objects [software/homer] - 10https://gerrit.wikimedia.org/r/551273 (owner: 10Ayounsi) [00:19:23] (03CR) 10jerkins-bot: [V: 04-1] Automatically cast network strings to ipaddress objects [software/homer] - 10https://gerrit.wikimedia.org/r/551273 (owner: 10Ayounsi) [00:20:32] (03CR) 10Faidon Liambotis: Automatically cast network strings to ipaddress objects (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/551273 (owner: 10Ayounsi) [00:35:50] (03PS1) 10Andrew Bogott: Typo fix, followup to d119b955d908fecad021c8427b5078ec1295112e [puppet] - 10https://gerrit.wikimedia.org/r/551298 (https://phabricator.wikimedia.org/T210715) [00:39:03] (03CR) 10Andrew Bogott: [C: 03+2] Typo fix, followup to d119b955d908fecad021c8427b5078ec1295112e [puppet] - 10https://gerrit.wikimedia.org/r/551298 (https://phabricator.wikimedia.org/T210715) (owner: 10Andrew Bogott) [01:36:25] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:36:55] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:16:42] (03CR) 10Ayounsi: Automatically cast network strings to ipaddress objects (032 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/551273 (owner: 10Ayounsi) [04:57:08] (03CR) 10AndreG-P: [C: 03+1] Enable links from math formulae on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551180 (https://phabricator.wikimedia.org/T208758) (owner: 10Physikerwelt) [07:49:53] PROBLEM - Disk space on Hadoop worker on an-worker1095 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/g 20 GB (0% inode=99%): /var/lib/hadoop/data/d 16 GB (0% inode=99%): /var/lib/hadoop/data/e 17 GB (0% inode=99%): /var/lib/hadoop/data/f 23 GB (0% inode=99%): /var/lib/hadoop/data/c 19 GB (0% inode=99%): /var/lib/hadoop/data/l 23 GB (0% inode=99%): /var/lib/hadoop/data/b 25 GB (0% inode=99%): /var/lib/hadoop/data/k [07:49:53] 99%): /var/lib/hadoop/data/i 25 GB (0% inode=99%): /var/lib/hadoop/data/h 21 GB (0% inode=99%): /var/lib/hadoop/data/m 22 GB (0% inode=99%): /var/lib/hadoop/data/j 17 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [08:15:37] RECOVERY - MariaDB Slave Lag: s6 on db2089 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:25:27] PROBLEM - Disk space on Hadoop worker on an-worker1094 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 25 GB (0% inode=99%): /var/lib/hadoop/data/b 24 GB (0% inode=99%): /var/lib/hadoop/data/f 26 GB (0% inode=99%): /var/lib/hadoop/data/k 27 GB (0% inode=99%): /var/lib/hadoop/data/g 26 GB (0% inode=99%): /var/lib/hadoop/data/m 19 GB (0% inode=99%): /var/lib/hadoop/data/c 27 GB (0% inode=99%): /var/lib/hadoop/data/d [08:25:27] 99%): /var/lib/hadoop/data/j 16 GB (0% inode=99%): /var/lib/hadoop/data/h 26 GB (0% inode=99%): /var/lib/hadoop/data/l 26 GB (0% inode=99%): /var/lib/hadoop/data/i 26 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [08:44:15] PROBLEM - Disk space on Hadoop worker on an-worker1094 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 24 GB (0% inode=99%): /var/lib/hadoop/data/b 23 GB (0% inode=99%): /var/lib/hadoop/data/f 27 GB (0% inode=99%): /var/lib/hadoop/data/k 27 GB (0% inode=99%): /var/lib/hadoop/data/g 25 GB (0% inode=99%): /var/lib/hadoop/data/m 18 GB (0% inode=99%): /var/lib/hadoop/data/c 26 GB (0% inode=99%): /var/lib/hadoop/data/d [08:44:15] 99%): /var/lib/hadoop/data/j 16 GB (0% inode=99%): /var/lib/hadoop/data/h 26 GB (0% inode=99%): /var/lib/hadoop/data/l 24 GB (0% inode=99%): /var/lib/hadoop/data/i 24 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [08:52:49] PROBLEM - Disk space on Hadoop worker on an-worker1094 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 23 GB (0% inode=99%): /var/lib/hadoop/data/b 22 GB (0% inode=99%): /var/lib/hadoop/data/f 26 GB (0% inode=99%): /var/lib/hadoop/data/k 26 GB (0% inode=99%): /var/lib/hadoop/data/g 30 GB (0% inode=99%): /var/lib/hadoop/data/m 21 GB (0% inode=99%): /var/lib/hadoop/data/c 26 GB (0% inode=99%): /var/lib/hadoop/data/d [08:52:49] 99%): /var/lib/hadoop/data/j 16 GB (0% inode=99%): /var/lib/hadoop/data/h 26 GB (0% inode=99%): /var/lib/hadoop/data/l 23 GB (0% inode=99%): /var/lib/hadoop/data/i 23 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [09:03:03] PROBLEM - Disk space on Hadoop worker on an-worker1094 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 20 GB (0% inode=99%): /var/lib/hadoop/data/b 21 GB (0% inode=99%): /var/lib/hadoop/data/f 23 GB (0% inode=99%): /var/lib/hadoop/data/k 25 GB (0% inode=99%): /var/lib/hadoop/data/g 24 GB (0% inode=99%): /var/lib/hadoop/data/m 21 GB (0% inode=99%): /var/lib/hadoop/data/c 24 GB (0% inode=99%): /var/lib/hadoop/data/d [09:03:03] 99%): /var/lib/hadoop/data/j 16 GB (0% inode=99%): /var/lib/hadoop/data/h 26 GB (0% inode=99%): /var/lib/hadoop/data/l 21 GB (0% inode=99%): /var/lib/hadoop/data/i 20 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [09:11:07] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 59.76 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:21:25] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 70.56 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:43:56] !log systemctl restart hadoop-* on analytics1077 after oom killer [09:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:29] RECOVERY - Hadoop NodeManager on analytics1077 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [09:44:29] RECOVERY - Hadoop DataNode on analytics1077 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [09:50:53] RECOVERY - Disk space on Hadoop worker on an-worker1094 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [09:51:17] RECOVERY - Disk space on Hadoop worker on an-worker1095 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [09:52:11] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 59.8 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:55:39] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 52.13 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:09:19] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 78.55 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:40:05] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 55.72 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:46:55] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 49.98 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:58:51] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 109.2 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [11:34:57] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 54.33 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [11:40:09] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 71.92 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [11:47:01] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 56.16 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [11:55:33] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 70.22 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:17:13] !log restart rsyslog on mw2221 [12:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:33] (03PS3) 10Jcrespo: bacula: Add prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/551145 (https://phabricator.wikimedia.org/T234900) [12:43:36] (03CR) 10jerkins-bot: [V: 04-1] bacula: Add prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/551145 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [13:11:58] (03PS1) 10Elukey: profile::analytics::cluster::client: add sudo to nagios command [puppet] - 10https://gerrit.wikimedia.org/r/551308 [13:15:13] (03CR) 10Elukey: [C: 03+2] profile::analytics::cluster::client: add sudo to nagios command [puppet] - 10https://gerrit.wikimedia.org/r/551308 (owner: 10Elukey) [15:10:33] (03CR) 10Physikerwelt: [C: 03+1] "Yes. Let's go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/551180 (https://phabricator.wikimedia.org/T208758) (owner: 10Physikerwelt) [16:22:05] RECOVERY - Memory correctable errors -EDAC- on mw1239 is OK: (C)4 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1239&var-datasource=eqiad+prometheus/ops [16:30:15] (03CR) 10MarcoAurelio: "FIXME: Caused T238480 as it also removed the views on `centralauth`." [puppet] - 10https://gerrit.wikimedia.org/r/550888 (https://phabricator.wikimedia.org/T237509) (owner: 10Andrew Bogott) [16:40:38] 10Operations, 10Wikimedia-Site-requests, 10Chinese-Sites, 10Community-consensus-needed: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991 (10Masumrezarock100) Why community consensus is still needed? Wasn't there a local discussion? [18:07:40] (03PS1) 10ArielGlenn: make dumpsdata1002 spare before reimaging [puppet] - 10https://gerrit.wikimedia.org/r/551317 (https://phabricator.wikimedia.org/T224563) [18:16:41] (03CR) 10ArielGlenn: [C: 03+2] make dumpsdata1002 spare before reimaging [puppet] - 10https://gerrit.wikimedia.org/r/551317 (https://phabricator.wikimedia.org/T224563) (owner: 10ArielGlenn) [18:24:22] (03PS1) 10ArielGlenn: make dupsdata1002 install buster instead of jessie [puppet] - 10https://gerrit.wikimedia.org/r/551319 (https://phabricator.wikimedia.org/T224563) [18:24:52] (03PS2) 10ArielGlenn: make dumpsdata1002 install buster instead of jessie [puppet] - 10https://gerrit.wikimedia.org/r/551319 (https://phabricator.wikimedia.org/T224563) [18:27:53] (03CR) 10ArielGlenn: [C: 03+2] make dumpsdata1002 install buster instead of jessie [puppet] - 10https://gerrit.wikimedia.org/r/551319 (https://phabricator.wikimedia.org/T224563) (owner: 10ArielGlenn) [18:57:27] 10Operations, 10DBA, 10SRE-Access-Requests, 10Patch-For-Review: Read access for phabricator-admins (aklapper) to Phabricator production database to run SELECT queries - https://phabricator.wikimedia.org/T238425 (10Aklapper) Dzahn rightfully pointed out that Phabricator uses quite some DBs. If I had to limi... [18:58:14] (03PS1) 10Andrew Bogott: pdns/cloudservices: add all_from list to dns api [puppet] - 10https://gerrit.wikimedia.org/r/551320 (https://phabricator.wikimedia.org/T210715) [19:01:15] (03PS2) 10Andrew Bogott: pdns/cloudservices: add all_from list to dns api [puppet] - 10https://gerrit.wikimedia.org/r/551320 (https://phabricator.wikimedia.org/T210715) [19:04:38] (03CR) 10Andrew Bogott: [C: 03+2] pdns/cloudservices: add all_from list to dns api [puppet] - 10https://gerrit.wikimedia.org/r/551320 (https://phabricator.wikimedia.org/T210715) (owner: 10Andrew Bogott) [19:28:11] 10Operations, 10Dumps-Generation: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ariel on cumin1001.eqiad.wmnet for hosts: ` ['dumpsdata1002.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/2... [19:42:24] PROBLEM - Check the Netbox report puppetdb for fail status. on netbox1001 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [19:51:09] 10Operations, 10MobileFrontend, 10TechCom-RFC, 10Traffic, 10Readers-Web-Backlog (Tracking): Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Darxus) Please do this. I come from an example of these .m. urls being obnoxious in the... [20:25:17] !log ariel@cumin1001 START - Cookbook sre.hosts.downtime [20:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:23] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:08] 10Operations, 10Dumps-Generation: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dumpsdata1002.eqiad.wmnet'] ` and were **ALL** successful. [20:40:44] RECOVERY - Check the Netbox report puppetdb for fail status. on netbox1001 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [21:02:35] 10Operations, 10Dumps-Generation: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) Expanded /data on dumpsdata1002, rsyncing copies of adds-changes dumps now from dumpsdata1003 in a screen session. After that I'll pick up the categoryrdf dumps, also via rsync... [21:06:05] 10Operations, 10Dumps-Generation: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) [21:21:57] 10Operations, 10Phabricator: List of recent most active Phab "Priority" field setters - https://phabricator.wikimedia.org/T235153 (10Aklapper) Thanks DZahn! <3