[00:00:48] !log switching apt.wikimedia.org from carbon to install1002 - there might be a short time until the LE SSL cert is also adjusted [00:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:01:48] 06Operations, 13Patch-For-Review: Split carbon's install/mirror roles, provision install1001 - https://phabricator.wikimedia.org/T132757#3018660 (10Dzahn) 16:02 < mutante> !log switching apt.wikimedia.org from carbon to install1002 - there might be a short time until the LE SSL cert is also adjusted [00:06:08] RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [00:07:18] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [00:09:03] (03PS1) 10Dzahn: Revert "switch apt.wm.org from carbon to install1002" [dns] - 10https://gerrit.wikimedia.org/r/337195 [00:09:32] (03CR) 10Dzahn: [C: 032] "doing this on Monday instead. i had not run authdns-update yet" [dns] - 10https://gerrit.wikimedia.org/r/337195 (owner: 10Dzahn) [00:12:23] (03PS1) 10Dzahn: switch apt.wm.org from carbon to install1002 [dns] - 10https://gerrit.wikimedia.org/r/337196 [00:13:08] (03PS2) 10Dzahn: switch apt.wm.org from carbon to install1002 [dns] - 10https://gerrit.wikimedia.org/r/337196 (https://phabricator.wikimedia.org/T132757) [00:13:38] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#3018671 (10bd808) [00:13:48] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#2334744 (10bd808) [00:15:05] (03PS1) 10Dzahn: remove carbon from puppet [puppet] - 10https://gerrit.wikimedia.org/r/337197 [00:15:18] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:18:19] (03PS1) 10Dzahn: let install1002 be the new source for APT data rsync [puppet] - 10https://gerrit.wikimedia.org/r/337198 [00:22:20] (03PS2) 10Dzahn: install: remove carbon from puppet and netboot [puppet] - 10https://gerrit.wikimedia.org/r/337197 (https://phabricator.wikimedia.org/T132757) [00:23:11] 06Operations, 10MediaWiki-Vagrant, 06Release-Engineering-Team, 07Epic: [EPIC] Migrate base image to Debian Jessie - https://phabricator.wikimedia.org/T136429#3018679 (10bd808) [00:23:28] RECOVERY - configured eth on d-i-test is OK: OK - interfaces up [00:23:28] RECOVERY - DPKG on d-i-test is OK: All packages OK [00:23:38] RECOVERY - dhclient process on d-i-test is OK: PROCS OK: 0 processes with command name dhclient [00:23:48] RECOVERY - Disk space on d-i-test is OK: DISK OK [00:23:48] RECOVERY - puppet last run on d-i-test is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:23:58] PROBLEM - puppet last run on mw1297 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:24:57] 06Operations, 10Gerrit, 13Patch-For-Review: Investigate seemingly random Gerrit slow-downs - https://phabricator.wikimedia.org/T148478#3018680 (10demon) I've got a pretty strong suspicion that this will fix the overall issue. [00:26:15] (03PS1) 10Dzahn: delete install1001/2001 from Hiera data [puppet] - 10https://gerrit.wikimedia.org/r/337199 (https://phabricator.wikimedia.org/T157840) [00:40:16] (03CR) 10Faidon Liambotis: [C: 04-1] Don't enable the Diamond ntpd collector if systemd-timesyncd is used (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/337009 (https://phabricator.wikimedia.org/T157794) (owner: 10Muehlenhoff) [00:44:18] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [00:49:37] (03CR) 10Dzahn: [C: 032] delete install1001/2001 from Hiera data [puppet] - 10https://gerrit.wikimedia.org/r/337199 (https://phabricator.wikimedia.org/T157840) (owner: 10Dzahn) [00:50:59] (03PS2) 10Dzahn: delete install1001/2001 from Hiera data [puppet] - 10https://gerrit.wikimedia.org/r/337199 (https://phabricator.wikimedia.org/T157840) [00:51:27] (03CR) 10Dzahn: [V: 032 C: 032] delete install1001/2001 from Hiera data [puppet] - 10https://gerrit.wikimedia.org/r/337199 (https://phabricator.wikimedia.org/T157840) (owner: 10Dzahn) [00:52:58] RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [01:01:10] (03Abandoned) 10Dzahn: install: copy/move apt.wm.org setup to aptrepo module [puppet] - 10https://gerrit.wikimedia.org/r/325864 (https://phabricator.wikimedia.org/T132757) (owner: 10Dzahn) [01:06:38] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:07:37] (03PS1) 10Dzahn: lint: 'include base::firewall' -> 'include ::base::firewall' [puppet] - 10https://gerrit.wikimedia.org/r/337201 [01:08:08] PROBLEM - puppet last run on db1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:08:49] (03CR) 10jerkins-bot: [V: 04-1] lint: 'include base::firewall' -> 'include ::base::firewall' [puppet] - 10https://gerrit.wikimedia.org/r/337201 (owner: 10Dzahn) [01:10:18] (03PS1) 10Dzahn: lint: 'include standard' -> 'include ::standard' [puppet] - 10https://gerrit.wikimedia.org/r/337202 [01:11:17] (03PS2) 10Dzahn: lint: 'include base::firewall' -> 'include ::base::firewall' [puppet] - 10https://gerrit.wikimedia.org/r/337201 [01:13:29] (03PS3) 10Faidon Liambotis: salt: use SHA256 master key fingeprint on newer systems [puppet] - 10https://gerrit.wikimedia.org/r/337189 [01:16:17] (03CR) 10Faidon Liambotis: [C: 032] salt: use SHA256 master key fingeprint on newer systems [puppet] - 10https://gerrit.wikimedia.org/r/337189 (owner: 10Faidon Liambotis) [01:16:34] (03PS1) 10Dzahn: contint: drop npm settings for precise [puppet] - 10https://gerrit.wikimedia.org/r/337203 [01:17:48] (03CR) 10jerkins-bot: [V: 04-1] contint: drop npm settings for precise [puppet] - 10https://gerrit.wikimedia.org/r/337203 (owner: 10Dzahn) [01:18:08] RECOVERY - salt-minion processes on d-i-test is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [01:18:18] RECOVERY - Check systemd state on d-i-test is OK: OK - running: The system is fully operational [01:18:25] (03PS1) 10Dzahn: mariadb/prometheus: remove workaround for precise [puppet] - 10https://gerrit.wikimedia.org/r/337204 [01:21:22] 06Operations: systemd-timedated starting up every minute - https://phabricator.wikimedia.org/T157797#3016701 (10faidon) timedated is a socket-activated daemon. systemd spawns it every time its socket gets connected to, and timedatectl is doing that. We call timedatectl from an Icinga/NRPE check (check_timedatect... [01:22:05] (03PS2) 10Dzahn: contint: drop npm settings for precise [puppet] - 10https://gerrit.wikimedia.org/r/337203 [01:24:09] 06Operations, 13Patch-For-Review: Evaluate use of systemd-timesyncd on jessie for clock synchronisation - https://phabricator.wikimedia.org/T150257#2779478 (10faidon) It looks like timesyncd is enabled out of the box on new stretch installs. Our test system, which doesn't have the hiera flag set, thus exhibits... [01:24:18] (03PS1) 10Dzahn: labs_vagrant: drop precise support [puppet] - 10https://gerrit.wikimedia.org/r/337205 [01:25:01] 06Operations, 10Monitoring, 10Traffic, 13Patch-For-Review: diamond crashing on hosts using systemd-timesyncd - https://phabricator.wikimedia.org/T157794#3016635 (10faidon) FWIW, stretch's version (4.0.515-3) doesn't crash, but complains about being unable to connect to the NTP server every few minutes. [01:29:20] (03PS1) 10Dzahn: toollabs: drop precise-related monitoring check [puppet] - 10https://gerrit.wikimedia.org/r/337207 [01:35:38] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [01:35:45] 06Operations: Replace nrpe 2.15 (& evaluate alternatives) - https://phabricator.wikimedia.org/T157853#3018830 (10faidon) [01:37:08] RECOVERY - puppet last run on db1023 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [01:37:40] 06Operations, 10Wikimedia-Logstash: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#3018849 (10Tgr) [02:01:11] (03PS1) 10Faidon Liambotis: Remove jzerebecki from Icinga contact groups [puppet] - 10https://gerrit.wikimedia.org/r/337209 [02:21:27] (03CR) 10Gergő Tisza: "Yes. See inline comments. '+' is handled at https://github.com/wikimedia/mediawiki/blob/d67197fa116acc366419faedeeacd91158a98f8b/includes/" (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336747 (https://phabricator.wikimedia.org/T157656) (owner: 10Gergő Tisza) [02:22:08] PROBLEM - Disk space on elastic2001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=98%) [02:30:28] PROBLEM - Disk space on elastic2025 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=98%) [02:31:51] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.11) (duration: 11m 31s) [02:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:10] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Feb 11 02:37:10 UTC 2017 (duration 5m 19s) [02:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:39:38] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:04:58] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 1809.673338 Seconds [03:05:58] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 24.36108 Seconds [03:07:38] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [03:30:58] PROBLEM - puppet last run on restbase1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:45:38] PROBLEM - Host cp1052 is DOWN: PING CRITICAL - Packet loss = 100% [03:51:18] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 54 connecting: cp1052_v4, cp1052_v6 [03:51:18] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 54 connecting: cp1052_v4, cp1052_v6 [03:51:28] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 54 connecting: cp1052_v4, cp1052_v6 [03:51:38] PROBLEM - IPsec on cp4009 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp1052_v4, cp1052_v6 [03:51:38] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp1052_v4, cp1052_v6 [03:51:38] PROBLEM - IPsec on cp4010 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp1052_v4, cp1052_v6 [03:51:38] PROBLEM - IPsec on cp4017 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp1052_v4, cp1052_v6 [03:51:39] PROBLEM - IPsec on cp4016 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp1052_v4, cp1052_v6 [03:51:39] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp1052_v4, cp1052_v6 [03:51:39] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp1052_v4, cp1052_v6 [03:51:39] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp1052_v4, cp1052_v6 [03:51:40] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp1052_v4, cp1052_v6 [03:51:40] PROBLEM - IPsec on cp4008 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp1052_v4, cp1052_v6 [03:51:48] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp1052_v4, cp1052_v6 [03:51:48] PROBLEM - IPsec on cp3031 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp1052_v4, cp1052_v6 [03:51:58] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 54 connecting: cp1052_v4, cp1052_v6 [03:51:58] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 54 connecting: cp1052_v4, cp1052_v6 [03:51:58] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 54 connecting: cp1052_v4, cp1052_v6 [03:51:58] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 54 connecting: cp1052_v4, cp1052_v6 [03:51:58] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp1052_v4, cp1052_v6 [03:52:04] zhuyifei1999_: ^^ ? [03:52:08] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 54 connecting: cp1052_v4, cp1052_v6 [03:52:08] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 42 connecting: cp1052_v4, cp1052_v6 [03:52:28] ? [03:58:58] RECOVERY - puppet last run on restbase1008 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [04:00:09] a text varnish went down [04:00:20] no impact, afaict [04:15:58] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2997.90 Read Requests/Sec=3076.50 Write Requests/Sec=6.60 KBytes Read/Sec=32710.40 KBytes_Written/Sec=2970.40 [04:26:58] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=80.00 Read Requests/Sec=202.90 Write Requests/Sec=252.90 KBytes Read/Sec=1868.40 KBytes_Written/Sec=2066.80 [06:46:38] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:57:08] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [07:09:38] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:14:38] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [07:15:08] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2201 [07:19:58] PROBLEM - puppet last run on mc1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:20:08] RECOVERY - check_mysql on frdb2001 is OK: Uptime: 1002652 Threads: 1 Questions: 20447294 Slow queries: 5269 Opens: 7939 Flush tables: 1 Open tables: 574 Queries per second avg: 20.393 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [07:25:08] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [07:37:38] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [07:45:28] PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:46:58] RECOVERY - puppet last run on mc1012 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [08:00:28] PROBLEM - puppet last run on elastic2001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [08:14:28] RECOVERY - puppet last run on elastic1018 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [08:24:28] PROBLEM - puppet last run on mc1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:29:28] PROBLEM - puppet last run on elastic2025 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [08:49:58] PROBLEM - puppet last run on db1068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:53:28] RECOVERY - puppet last run on mc1006 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [09:02:48] PROBLEM - puppet last run on puppetmaster2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:09:28] !log cleanup logs on elastic20(01|25) - T139043 [09:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:33] T139043: nested RemoteTransportExceptions filled the disk on elastic1036 and elastic1045 during a rolling restart - https://phabricator.wikimedia.org/T139043 [09:10:28] RECOVERY - puppet last run on elastic2001 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [09:11:28] RECOVERY - Disk space on elastic2025 is OK: DISK OK [09:11:28] RECOVERY - puppet last run on elastic2025 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [09:13:18] (03PS1) 10Gehel: elasticsearch - reimage elastic20(33|34|35|36) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/337218 (https://phabricator.wikimedia.org/T151326) [09:14:28] (03CR) 10Gehel: [C: 032] elasticsearch - reimage elastic20(33|34|35|36) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/337218 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [09:15:23] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic20(33|34|35|36).codfw.wmnet [09:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:27] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3019090 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2034.codfw.wmnet'] ```... [09:16:30] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3019091 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2033.codfw.wmnet'] ```... [09:16:47] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3019092 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2035.codfw.wmnet'] ```... [09:16:51] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3019093 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2036.codfw.wmnet'] ```... [09:17:58] RECOVERY - puppet last run on db1068 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [09:19:38] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [09:30:48] RECOVERY - puppet last run on puppetmaster2001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [09:35:12] !log rebooting mw1236 to make sure that it comes up cleanly - T156610 [09:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:17] T156610: mw1236 powered down and not able to powerup - https://phabricator.wikimedia.org/T156610 [09:37:38] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [09:40:42] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3019098 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2033.codfw.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic203... [09:42:25] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3019099 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2036.codfw.wmnet'] ``` and were **ALL** successful. [09:52:10] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1236.eqiad.wmnet [09:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:38] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3019103 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2035.codfw.wmnet'] ``` and were **ALL** successful. [09:53:28] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3019104 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2034.codfw.wmnet'] ``` and were **ALL** successful. [09:53:33] !log mw1236 back in production (scap pull executed before pooled=yes) - T156610 [09:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:38] T156610: mw1236 powered down and not able to powerup - https://phabricator.wikimedia.org/T156610 [09:54:11] 06Operations, 10ops-eqiad: mw1236 powered down and not able to powerup - https://phabricator.wikimedia.org/T156610#3019108 (10elukey) 05Open>03Resolved Thanks @Cmjohnson!! [10:02:20] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic20(33|34|35|36).codfw.wmnet [10:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:00] mw1236 looks good! [10:08:05] RECOVERY - Disk space on elastic2001 is OK: DISK OK [12:33:15] PROBLEM - puppet last run on rcs1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:02:15] RECOVERY - puppet last run on rcs1002 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [14:38:25] PROBLEM - puppet last run on db1087 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:55:45] PROBLEM - puppet last run on ms-be1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:06:25] RECOVERY - puppet last run on db1087 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [15:24:45] RECOVERY - puppet last run on ms-be1022 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [15:27:29] hoo hey [15:27:40] hi DatGuy [15:28:13] do you have any idea about https://en.wikipedia.org/w/index.php?limit=100&title=Special%3AContributions&contribs=user&target=5.142.2*&namespace=&tagfilter=&year=2017&month=-1 ? [15:28:20] hi I don't know much about all this bot stuff but there are a number of bot edits logged out right now [15:28:25] https://en.wikipedia.org/wiki/Special:Contributions/10.68.23.103 [15:28:29] figured it was worth mentioning [15:28:38] whoops [15:28:40] ignore my message [15:28:52] Chrissymad's link is what i wanted to send [15:28:53] hm ok [15:29:03] could we add assert to the login? [15:29:17] This happens to AdminStats frequently [15:29:21] the bot author(s) can do that [15:29:51] if you deem this a problem, you can (soft!!!) block the ip with a notice [15:29:53] Chrissymad: That's no good. I'd be tempted to block it [15:30:20] I guess we can even globally block [15:30:35] also a problem on commons, it seems (although a minor one, one edit only) [15:30:35] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:30:54] I blocked it for 48 hours on enwiki [15:31:00] cool [15:31:09] https://commons.wikimedia.org/w/index.php?title=Commons:Database_reports/Unusually_long_user_blocks&diff=prev&oldid=225530941 [15:31:19] maybe a year globally would alos be a good idea? [15:31:28] Does anyone ever need to do things as IP from that IP? [15:31:33] (Like signing up, …) [15:31:44] You'd presume not [15:32:08] I'll email JamesR [15:32:48] It's supposed to be https://en.wikipedia.org/wiki/User:AdminStatsBot [15:32:59] Globally blocked for a year [15:33:01] Is there a way for you guys to tell if an IP edit (or rather a series of IP edits) are actually a bot or is that something a CU would need to do? Cause I have some suspicions about an LTA issue. [15:33:34] Reedy, that ip I linked earlier also seems to be BernsteinBot https://en.wikipedia.org/w/index.php?title=Wikipedia:Database_reports/Largely_duplicative_file_names&diff=prev&oldid=764895275 [15:33:47] Chrissymad: It's likely many bots editing [15:35:24] hoo: just block 10.0.0.0/8 ;P [15:36:28] Like that? https://qph.ec.quoracdn.net/main-qimg-ef279af035810c5317e01b5f24b8b8b9-c [15:39:24] https://en.wikipedia.org/wiki/Special:Contributions/5.142.204.95 also may want to take a look at that I just rolled back their edits to datbot [15:40:35] PROBLEM - puppet last run on db1048 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[pt-heartbeat-kill] [15:42:15] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [15:46:55] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 3.085 second response time [15:58:35] RECOVERY - puppet last run on labsdb1007 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [16:08:35] RECOVERY - puppet last run on db1048 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:14:05] Chrissymad: Who cares? [16:14:33] hoo, Reedy: I guess you all haven't been following this part of Gerrit. [16:14:52] hoo, Reedy, Chrissymad: https://gerrit.wikimedia.org/r/#/c/324215/ [16:15:14] hoo: Blocking the IP address range doesn't actually solve whatever issue is logging these bots out. [16:15:45] Or invalidating their sessions or whatever. [16:21:02] Yvette: it does, in some sense. Instead of editing anonymously, the bot will now crash (or not work), and the issue will end up back at the maintainer [16:21:23] Yvette: and it prevents a much larger issue: sysops blocking the IP //including// edits from logged-in accounts [16:22:40] valhallasw`cloud: So because of some issue with user sessions, editors get punished? [16:22:49] The bots don't make an edit instead of making an edit that "unattributed." [16:22:58] that's * [16:23:13] I'd personally rather have the edits. And maybe someone can fix whatever session issue is clearly happening. [16:23:23] Saying you want to stop dumb admins is a lame excuse. [16:25:49] Sometimes bots don't edit because of bugs in their code. That's just how it goes. [16:26:23] And no, dumb admins are not a lame excuse. This has happened, and it of course caused larger scale issues. [16:26:52] valhallasw`cloud: Bugs in their code? These scripts have mostly been running fine for years. [16:27:02] You really think it's the scripts to blame? [16:27:18] And it's a terrible excuse to say "well a dumb admin could make a dumb block, so let's make the dumb block first." [16:27:30] Shrug. [16:28:07] Well, they should be checking if they're logged in [16:28:11] To what end? [16:28:21] Yvette: The ecosystem changes and sometimes bots have to be adapted. Then again, assert=user isn't exactly a new feature in the API, and logging in again when that assert fails is a pretty basic feature for a bot. [16:28:24] The bots are logging in. Somehow they're losing their session. [16:29:01] valhallasw`cloud: The ecosystem can change all it wants, but until someone can point to the script I'm using as the culprit, I'll assume the script I haven't touched in years isn't to blame. [16:29:24] It's way more likely, IMO, that Wikimedia did something stupid with session handling here. [16:29:26] But if you haven't updated it to take account of changes in the MW api etc [16:29:30] What changes? [16:29:37] "years" is vague [16:29:40] Many things have changed [16:30:02] https://en.wikipedia.org/w/index.php?title=Wikipedia:Database_reports/Largely_duplicative_file_names&action=history [16:30:04] Login processes have changed [16:30:06] session cookies are no longer passed in the api response, for example (and just as actual cookies) [16:30:08] You can easily look at that page history. [16:30:13] And see that the bot logs in. [16:30:19] session lifetime may have changed [16:30:26] It was working yesterday. [16:30:28] It intermittently fails. [16:30:35] Obviously I'm not changing anything. [16:31:02] That script has been running since June 2011. [16:31:49] Anyway, I'm not adding assert logic to fix this as I'm fine with the bot editing logged out. [16:33:04] I might be willing to help diagnose whatever dumb issue is happening. [16:33:58] https://en.wikipedia.org/wiki/Special:Contributions/10.68.23.223 <-- Which more likely: something in my scripts changed on November 4 or server-side session handling got screwed up somehow? [16:34:04] I'm hungry. [16:34:30] I'd presume a deploy happened [16:34:57] Hmmm. That's an interesting theory. That shouldn't invalidate most user sessions, though. [16:34:59] Yvette: I'm assuming that at your cookie was no longer valid, and your bot ignored this instead of logging in again [16:35:08] I mean, if it did, presumably users would complain a lot more often. [16:35:23] valhallasw`cloud: Why would my cookie be invalid? [16:35:26] What if all sessions were invalited for a specific reason? [16:35:35] Then you'd annoy everyone a lot? [16:35:35] You'll just carry on editing logged out because you don't care? [16:35:39] Yeah [16:35:41] Sure, why not? [16:35:44] We allow anonymous editing. [16:35:47] "Anonymous." [16:35:51] Which is why it's not done purposefully without good reason [16:35:55] It's more annoying to not have the edits, surely. [16:36:03] Depends on the edits [16:36:09] Many don't add any value [16:36:11] Like nobody is looking at that page and going "gosh, I really wish I knew who wrote this!" [16:36:17] Many edits? [16:36:22] Talk to the editors, then. [16:37:10] valhallasw`cloud: If it fails once every three months or whatever, I'm definitely not fixing it. [16:37:19] Patches welcome, tho. [16:37:35] then the bot will not edit every three months or whatever ¯\_(ツ)_/¯ [16:37:44] So how come it works after that? [16:37:48] something else obviously changes [16:37:50] After what? [16:37:56] the bot probably logs in again? [16:37:56] The next edit is logged in [16:38:03] Oh, it logs in every time. [16:38:12] It's a daily script. [16:38:15] https://en.wikipedia.org/w/index.php?title=Wikipedia:Database_reports/Largely_duplicative_file_names/Configuration [16:38:16] But then doesn't check if the login actually worked? [16:38:21] Check how? [16:38:25] assert=user [16:38:27] It logs in and edits the page every day. [16:38:33] The bot logs in and edits the page every day. [16:38:43] One some days, intermittently, it loses its session (we think). [16:38:45] Or maybe the login failed. [16:38:49] Probably the session, though. [16:38:52] https://en.wikipedia.org/w/api.php?action=query&meta=userinfo will tell you what the API thinks you are [16:38:58] I don't care, though. [16:39:05] I understand how the assertion code works. [16:39:18] I'm saying that this scripts works 88 out of 90 times. [16:39:20] gj [16:39:25] And the two times it fails are probably not its fault. [16:39:33] Since I'm not changing the script or touching any of its code. [16:39:45] Meanwhile, on the other side, people are constantly changing the code. [16:40:02] Yvette: so how does this script work? log in, do query, then edit the page with the result? [16:40:09] Yeah. [16:40:11] https://en.wikipedia.org/w/index.php?title=Wikipedia:Database_reports/Largely_duplicative_file_names/Configuration [16:40:15] It's a pretty simple script. [16:40:28] For long-running queries, I think I've sometimes moved the login code below the query execution. [16:40:29] Interestingly, this assert edit has been around at least since Nov 2007 [16:40:29] move the login to just before the edit? [16:40:39] Yeah, I could. [16:40:47] talking about simple solutions [16:40:48] But again, don't want to at all. [16:40:49] So this problem has no doubt been solved since long before the script was written [16:40:55] valhallasw`cloud: Or someone could diagnose the actual issue? [16:41:02] Reedy: What problem? [16:41:10] Finding out if you're logged in [16:41:11] You both are looking at solutions instead of the actual problem. [16:41:16] That's not the problem, bro. [16:41:20] Have you reported it? [16:41:22] Yvette: you are assuming cookie invalidation is a problem. It's not. [16:41:27] Have you provided detailed debugging details? [16:41:29] valhallasw`cloud: Why not? [16:41:34] Cookies, session info etc you think should be active? [16:41:59] Reedy: No. I'm not sure how many ways I can say that I don't care about the occasional logged-out edit [16:42:02] . [16:42:06] Other people do [16:42:08] Who? [16:42:12] Yvette: because sessions can be invalidated for many different reasons. They are an edge case a bot should take care of. [16:42:27] valhallasw`cloud: Since when? [16:42:34] Does your user session get invalidated in a browser? [16:42:39] You presumably could've prevented the logged out edits from the bot/client side in less time than this discussion has been going on [16:42:40] Like when does that ever happen? [16:42:44] Every 30 days [16:42:47] No. [16:42:49] That's not true. [16:43:06] We set the cookie for 365 days on Wikimedia wikis. And this script is logging in every day. [16:43:10] $wgCookieExpiration = 30 * 86400; [16:43:10] $wgExtendedLoginCookieExpiration = 365 * 86400; [16:43:21] Well, if it's logging in and it's not logged in, something is going wrong [16:43:28] No kidding. [16:43:29] Maybe your code. Maybe not [16:43:33] Maybe someone should investigate that. [16:43:38] It's not just this script, BTW. [16:43:39] File a bug then? [16:43:44] Report it [16:43:49] Give detailed information [16:43:58] Tell when you logged in, when the edits were made logged out [16:44:00] Yvette: you're right, it's a year. Yet I log in far more often. [16:44:06] So first I should go edit all of these scripts. [16:44:09] Rather than expecting someone to grep many logs [16:44:13] Now I should go file a detailed ticket for you. [16:44:19] Any other free work you'd like from me? [16:44:22] On a Saturday. [16:44:26] Just wondering. [16:44:27] You've got nothing better to do, right? [16:44:31] I need lunch! [16:44:32] It's my birthday for christ sakes [16:44:36] And I'm arguging about inane shit with you [16:44:38] Oh, happy birthday! [16:44:50] * Yvette hugs. [16:44:58] * Chrissymad gives Reedy a beer [16:45:23] Reedy: Might also be Wikimedia Labs. [16:45:29] Since that's a common factor here, I think. [16:45:31] Yvette: Wouldn't happened on toolserver. [16:45:35] ikr [16:45:43] The Toolserver had a soft-block, heh. [16:45:52] At least on the English Wikipedia. [16:46:15] Ok, then my argument magically changes to 'This is how the Toolserver did it'. Done! :P [16:46:44] https://gerrit.wikimedia.org/r/#/c/324215/ is the relevant changeset. [16:46:48] I already said my piece there. [16:47:05] I'm not really opposed to soft-blocking, especially as there's precedent, but it still doesn't solve whatever the actual issue is. [16:47:10] And probably just masks it deeper. [16:47:36] Well, pissing off people so they look into it more thoroughly seems like a way forward [16:47:39] Rather than not fixing it at all [16:48:16] Lawl. That's, uhh, quite an approach. [16:48:55] If it happened regularly, I'd be more inclined to diagnose. But it happens like once a month or so. [17:36:45] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:04:45] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [18:59:35] PROBLEM - puppet last run on db1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:26:49] (03PS1) 10Brion VIBBER: Bump up number of queue runners for transcodes [puppet] - 10https://gerrit.wikimedia.org/r/337230 (https://phabricator.wikimedia.org/T108234) [19:28:35] RECOVERY - puppet last run on db1049 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [20:37:15] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:37:35] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:05:35] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [21:06:15] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [22:35:25] PROBLEM - puppet last run on maps1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:41:25] PROBLEM - puppet last run on ms-be1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:03:25] RECOVERY - puppet last run on maps1003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [23:09:25] RECOVERY - puppet last run on ms-be1016 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [23:50:13] (03CR) 10Reedy: [C: 031] Fix SiteConfiguration array merge syntax [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336747 (https://phabricator.wikimedia.org/T157656) (owner: 10Gergő Tisza) [23:57:25] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues