[00:00:24] PROBLEM - SSH cp3049.mgmt.esams.wmnet on cp3049.mgmt.esams.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:04:26] (03PS1) 10Dzahn: rancid: disable fully automatic rsyncing of app data [puppet] - 10https://gerrit.wikimedia.org/r/364624 (https://phabricator.wikimedia.org/T166180) [00:05:52] (03CR) 10Dzahn: [C: 032] rancid: disable fully automatic rsyncing of app data [puppet] - 10https://gerrit.wikimedia.org/r/364624 (https://phabricator.wikimedia.org/T166180) (owner: 10Dzahn) [00:16:39] (03PS1) 10Dzahn: rancid: use cron{} resource instead of file in /etc/cron.d/ [puppet] - 10https://gerrit.wikimedia.org/r/364625 [00:20:55] (03CR) 10Dzahn: [C: 032] rancid: use cron{} resource instead of file in /etc/cron.d/ [puppet] - 10https://gerrit.wikimedia.org/r/364625 (owner: 10Dzahn) [00:24:39] (03CR) 10Dzahn: "before:" [puppet] - 10https://gerrit.wikimedia.org/r/364625 (owner: 10Dzahn) [00:47:10] (03PS1) 10Dzahn: rancid/netmon: add active_server parameter to DC-switch [puppet] - 10https://gerrit.wikimedia.org/r/364629 [00:48:02] (03CR) 10jerkins-bot: [V: 04-1] rancid/netmon: add active_server parameter to DC-switch [puppet] - 10https://gerrit.wikimedia.org/r/364629 (owner: 10Dzahn) [00:49:40] (03PS1) 10BBlack: puppet-run: manual splay at the top [puppet] - 10https://gerrit.wikimedia.org/r/364630 [00:53:05] 10Operations: Remove mbrar@wikimedia.org from legal-tm-vio@wikimedia.org - https://phabricator.wikimedia.org/T170361#3428712 (10Reedy) [00:53:28] (03PS2) 10Dzahn: rancid/netmon: add active_server parameter to DC-switch [puppet] - 10https://gerrit.wikimedia.org/r/364629 [00:53:38] 10Operations: Remove mbrar@wikimedia.org from legal-tm-vio@wikimedia.org - https://phabricator.wikimedia.org/T170361#3428713 (10Dzahn) a:03Dzahn [00:57:01] 10Operations: Remove mbrar@wikimedia.org from legal-tm-vio@wikimedia.org - https://phabricator.wikimedia.org/T170361#3428721 (10Dzahn) Hi Brendan. Done! Here's the diff: ``` -legal-tm-vio: slaporte, rstallman, mbrar, jrogers, kfrancis, croslof, trademark +legal-tm-vio: slaporte, rstallman, jrogers, kfranci... [00:57:23] 10Operations: Remove mbrar@wikimedia.org from legal-tm-vio@wikimedia.org - https://phabricator.wikimedia.org/T170361#3428722 (10Dzahn) 05Open>03Resolved [01:00:14] RECOVERY - SSH cp3049.mgmt.esams.wmnet on cp3049.mgmt.esams.wmnet is OK: SSH OK - OpenSSH_5.8 (protocol 2.0) [01:01:58] 10Operations: Remove mbrar@wikimedia.org from legal-tm-vio@wikimedia.org - https://phabricator.wikimedia.org/T170361#3428732 (10Dzahn) @bcampell @bbogaert We would also be happy to move this alias over to you guys at any time. Just like the other ones we moved as part of T122144. There have been many [1] reques... [01:05:25] 10Operations, 10Mail: move legal-tm-vio alias to OIT - https://phabricator.wikimedia.org/T170365#3428748 (10Dzahn) [01:05:56] 10Operations, 10Mail, 10Office-IT, 10WMF-Legal: move legal-tm-vio alias to OIT - https://phabricator.wikimedia.org/T170365#3428748 (10Dzahn) [01:06:51] 10Operations, 10Mail, 10Office-IT, 10WMF-Legal: move legal-tm-vio alias to OIT - https://phabricator.wikimedia.org/T170365#3428748 (10Dzahn) [01:09:13] (03CR) 10Dzahn: "thanks for the icinga fix :)" [puppet] - 10https://gerrit.wikimedia.org/r/364174 (https://phabricator.wikimedia.org/T169070) (owner: 10Muehlenhoff) [01:10:20] (03CR) 10Dzahn: "thanks for this, had similar intention and you already did it:)" [puppet] - 10https://gerrit.wikimedia.org/r/364373 (owner: 10Muehlenhoff) [01:16:45] (03PS3) 10Dzahn: rancid/netmon: add active_server parameter to DC-switch [puppet] - 10https://gerrit.wikimedia.org/r/364629 [01:17:33] 10Operations, 10RESTBase, 10RESTBase-API, 10Traffic, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#3428785 (10Krinkle) @GWicke Regarding T138848, note that there are two separate problems imho. I don't mind them being solved at the same time... [01:18:22] 10Operations, 10Mail, 10Office-IT, 10WMF-Legal: move legal-tm-vio alias to OIT - https://phabricator.wikimedia.org/T170365#3428787 (10Dzahn) [01:18:31] 10Operations: Remove mbrar@wikimedia.org from legal-tm-vio@wikimedia.org - https://phabricator.wikimedia.org/T170361#3428453 (10Dzahn) I made T170365 for this. [01:19:39] (03PS4) 10Dzahn: rancid/netmon: add active_server parameter to DC-switch [puppet] - 10https://gerrit.wikimedia.org/r/364629 (https://phabricator.wikimedia.org/T166180) [01:31:26] 10Operations, 10RESTBase, 10RESTBase-API, 10Traffic, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#3428800 (10GWicke) @krinkle: Agreed that there are some subtle differences between the tasks. I mainly merged them since the discussion here h... [01:38:03] 10Operations, 10RESTBase, 10RESTBase-API, 10Traffic, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#3428811 (10Krinkle) >>! In T133178#3345748, @GWicke wrote: > To summarize the options using a single domain only: > > ## Use www.wikimedia.or... [01:47:06] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/7016/" [puppet] - 10https://gerrit.wikimedia.org/r/364629 (https://phabricator.wikimedia.org/T166180) (owner: 10Dzahn) [01:58:02] (03CR) 10Dzahn: "s[netmon1002:~] $ sudo crontab -u rancid -l" [puppet] - 10https://gerrit.wikimedia.org/r/364629 (https://phabricator.wikimedia.org/T166180) (owner: 10Dzahn) [02:07:24] (03PS1) 10Dzahn: rsync::quickdatacopy: add 'sync-' prefix to /usr/local/sbin/ file [puppet] - 10https://gerrit.wikimedia.org/r/364636 [02:11:07] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [02:12:54] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [02:14:54] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 39 probes of 435 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [02:17:29] (03PS2) 10Dzahn: rsync::quickdatacopy: add 'sync-' prefix to /usr/local/sbin/ file [puppet] - 10https://gerrit.wikimedia.org/r/364636 [02:19:27] (03PS3) 10Dzahn: rsync::quickdatacopy: add 'sync-' prefix to /usr/local/sbin/ file [puppet] - 10https://gerrit.wikimedia.org/r/364636 [02:19:55] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 435 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [02:20:24] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [02:21:08] (03CR) 10Dzahn: [C: 032] rsync::quickdatacopy: add 'sync-' prefix to /usr/local/sbin/ file [puppet] - 10https://gerrit.wikimedia.org/r/364636 (owner: 10Dzahn) [02:23:24] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [3000.0] [02:23:26] (03CR) 10Dzahn: "deleted old files manually on releases1001 and netmon2001" [puppet] - 10https://gerrit.wikimedia.org/r/364636 (owner: 10Dzahn) [02:24:24] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [02:27:24] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [3000.0] [02:29:53] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.7) (duration: 09m 25s) [02:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:57:28] (03PS1) 10Dzahn: rsync::quickdatacopy: copy data via IPv4, don't rely on IPv6 setup [puppet] - 10https://gerrit.wikimedia.org/r/364640 [03:03:48] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.9) (duration: 13m 57s) [03:03:53] (03PS1) 10Dzahn: add IPv6 records for netmon2001 [dns] - 10https://gerrit.wikimedia.org/r/364641 (https://phabricator.wikimedia.org/T166180) [03:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:49] (03PS1) 10Dzahn: rsync::quickdatacopy: add ferm rule also for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/364642 [03:10:42] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Jul 12 03:10:42 UTC 2017 (duration 6m 54s) [03:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:40:38] (03PS2) 10Dzahn: add IPv6 records for netmon2001 [dns] - 10https://gerrit.wikimedia.org/r/364641 (https://phabricator.wikimedia.org/T166180) [03:41:16] (03CR) 10Dzahn: [C: 032] add IPv6 records for netmon2001 [dns] - 10https://gerrit.wikimedia.org/r/364641 (https://phabricator.wikimedia.org/T166180) (owner: 10Dzahn) [03:43:30] (03CR) 10Dzahn: "host netmon2001.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/364641 (https://phabricator.wikimedia.org/T166180) (owner: 10Dzahn) [03:44:56] (03PS2) 10Dzahn: rsync::quickdatacopy: copy data via IPv4, don't rely on IPv6 setup [puppet] - 10https://gerrit.wikimedia.org/r/364640 [03:45:57] (03PS2) 10Dzahn: rsync::quickdatacopy: add ferm rule also for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/364642 [03:47:22] (03PS3) 10Dzahn: rsync::quickdatacopy: add ferm rule also for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/364642 [03:49:21] (03CR) 10Dzahn: [C: 032] rsync::quickdatacopy: add ferm rule also for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/364642 (owner: 10Dzahn) [03:51:22] (03PS4) 10Dzahn: rsync::quickdatacopy: add ferm rule also for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/364642 [03:52:57] (03PS5) 10Dzahn: rsync::quickdatacopy: add ferm rule also for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/364642 [03:58:54] 10Operations, 10Mail, 10Office-IT, 10WMF-Legal: move legal-tm-vio alias to OIT - https://phabricator.wikimedia.org/T170365#3428748 (10Slaporte) Sounds good to me. [04:13:12] (03CR) 10Dzahn: "this fixed the rsync without having to do https://gerrit.wikimedia.org/r/#/c/364640/" [puppet] - 10https://gerrit.wikimedia.org/r/364642 (owner: 10Dzahn) [04:13:41] (03Abandoned) 10Dzahn: rsync::quickdatacopy: copy data via IPv4, don't rely on IPv6 setup [puppet] - 10https://gerrit.wikimedia.org/r/364640 (owner: 10Dzahn) [04:14:22] (03PS2) 10Dzahn: switch librenms from netmon1001 to netmon1002 [dns] - 10https://gerrit.wikimedia.org/r/364617 (https://phabricator.wikimedia.org/T159756) [04:15:58] (03PS3) 10Dzahn: admin: Remove deployers from restricted group [puppet] - 10https://gerrit.wikimedia.org/r/364469 (https://phabricator.wikimedia.org/T104671) (owner: 10Dereckson) [04:16:07] (03PS4) 10Dzahn: admin: Remove deployers from restricted group [puppet] - 10https://gerrit.wikimedia.org/r/364469 (https://phabricator.wikimedia.org/T104671) (owner: 10Dereckson) [04:20:09] (03CR) 10Dzahn: icinga/role:mail::mx: add monitoring of exim queue size (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn) [04:20:28] (03PS7) 10Dzahn: icinga/role:mail::mx: add monitoring of exim queue size [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) [04:21:45] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [04:23:07] (03PS8) 10Dzahn: icinga/role:mail::mx: add monitoring of exim queue size [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) [04:25:21] (03CR) 10Dzahn: "fixed commit message per Filippo's comments. changed 4h to 10m per Alexandros' comments." [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn) [04:25:54] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [3000.0] [04:29:20] 10Operations, 10monitoring, 10Patch-For-Review: rack/setup/install netmon2001 - https://phabricator.wikimedia.org/T166180#3428934 (10Dzahn) [04:30:04] 10Operations, 10monitoring, 10Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3207696 (10Dzahn) [05:20:35] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [05:22:44] RECOVERY - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [05:23:44] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [3000.0] [05:31:27] 10Operations, 10Traffic, 10netops: Recurring varnish-be fetch failures in codfw - https://phabricator.wikimedia.org/T170131#3429009 (10ayounsi) The BFD timer change didn't improve anything. Got emails from Zayo saying they fixed an issue on the the eqiad-codfw link (finished at around 02:14Z today). Will mon... [05:35:54] PROBLEM - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [3000.0] [05:41:44] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:41:46] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:41:46] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:41:46] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:41:46] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:41:46] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:41:46] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:41:54] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:41:55] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:41:55] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:42:04] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:42:05] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:42:09] backups \o/ [05:42:11] I will silence that [05:42:14] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:42:14] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:42:14] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:42:15] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:42:24] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:42:34] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:42:34] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:42:34] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:42:44] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:42:45] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:42:45] RECOVERY - MariaDB Slave IO: s3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:42:45] RECOVERY - MariaDB Slave SQL: s5 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:42:54] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:42:54] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:42:55] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:43:04] RECOVERY - MariaDB Slave IO: s6 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:43:04] RECOVERY - MariaDB Slave IO: s1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:43:04] RECOVERY - MariaDB Slave IO: x1 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:43:05] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:43:14] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:43:24] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:43:24] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [05:43:34] RECOVERY - MariaDB Slave IO: s4 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:43:34] RECOVERY - MariaDB Slave IO: m2 on dbstore1001 is OK: OK slave_io_state not a slave [05:43:35] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [05:43:35] RECOVERY - MariaDB Slave SQL: x1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:43:35] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:43:44] RECOVERY - MariaDB Slave SQL: s7 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [05:49:39] 10Puppet, 10Cloud-VPS: role::puppetmaster::standalone has no firewall rule for port 8140 - https://phabricator.wikimedia.org/T154150#3429032 (10bd808) [05:55:24] RECOVERY - MariaDB Slave SQL: x1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [05:58:14] RECOVERY - MariaDB Slave Lag: x1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.39 seconds [05:59:09] 10Puppet, 10Cloud-Services, 10Continuous-Integration-Infrastructure, 10Beta-Cluster-reproducible: New instances attached to a role::puppetmaster::standalone Puppetmaster need manual changes after switching from the default Puppetmaster - https://phabricator.wikimedia.org/T148929#3429039 (10bd808) [06:03:14] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=2461.00 Read Requests/Sec=512.30 Write Requests/Sec=2.40 KBytes Read/Sec=50319.60 KBytes_Written/Sec=49.60 [06:04:06] 10Operations, 10Puppet, 10Cloud-Services: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#3429048 (10bd808) 05Open>03Resolved WP:BOLD'ly closing this stale task. The LDAP enc is long gone now. [06:07:51] 10Operations, 10netops: Remove unsecure SSH algorithms on network devices - https://phabricator.wikimedia.org/T170369#3429050 (10ayounsi) [06:11:22] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.00 Read Requests/Sec=346.10 Write Requests/Sec=1.50 KBytes Read/Sec=1386.80 KBytes_Written/Sec=35.20 [06:12:17] 10Puppet, 10Cloud-Services: Make changing puppetmasters for Labs instances more easy - https://phabricator.wikimedia.org/T152941#2864105 (10bd808) p:05Lowest>03Low == Workaround == See https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster: Agent: ``` $ sudo -i puppet agent -tv $ sudo rm -fR /va... [06:18:52] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3429080 (10Marostegui) [06:26:57] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3429081 (10Marostegui) [06:27:00] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3156297 (10Marostegui) Thanks @jcrespo and @Cmjohnson for advancing a lot on this task! The only pending host now to be able to resolve this task is db1106 which looks like it doesn't have... [06:27:03] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364657 [06:27:06] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364657 [06:27:41] 10Puppet, 10Cloud-VPS: Invesitgate use of Puppet "modules" for per-project Puppet manifests - https://phabricator.wikimedia.org/T170370#3429083 (10bd808) [06:31:59] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364657 (owner: 10Marostegui) [06:32:53] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364657 (owner: 10Marostegui) [06:33:07] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364657 (owner: 10Marostegui) [06:34:14] What is going on on tin:/srv/mediawiki-staging ? : https://phabricator.wikimedia.org/P5727 [06:34:38] no idea [06:34:43] ping releng [06:35:03] yeah [06:38:03] <_joe_> I'll take a look as well [06:38:16] Thanks [06:38:24] I have pinged releng on their channel too [06:39:27] (03PS1) 10Marostegui: sanitarium3.my.cnf: Save binlogs 30 days [puppet] - 10https://gerrit.wikimedia.org/r/364659 (https://phabricator.wikimedia.org/T153743) [06:41:36] <_joe_> marostegui: phone them [06:41:45] <_joe_> this is totally unacceptable [06:45:05] <_joe_> so, I think I start to understand what happened here [06:45:10] <_joe_> my proposal is: [06:45:24] <_joe_> 1) I'll save what's in master on tin on a "wtf" branch [06:45:41] <_joe_> 2) I'll reset master on tin to what we have in gerrit [06:45:49] <_joe_> (that includes manuel's patch) [06:46:16] <_joe_> oh and 0 - we're in detached head now, so I'll save the state in a wtf-live branch [06:46:21] I am going to text hashar to see if he is around [06:46:27] <_joe_> so we can all verify it's ok [06:48:10] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 898.08 seconds [06:48:24] ^ expected [06:48:26] I will silence [06:49:10] <_joe_> !log saved the current state of mediawiki-staging (in detached head) in the branch "wtf-live"; saved what is in master on tin in "wtf-master"; reset master to the latest commit in origin/master [06:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:50] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [06:52:43] <_joe_> this ^^ should be fixed now [06:52:59] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [06:53:17] <_joe_> jynus, marostegui can you take a look and confirm you agree with what I did? [06:53:22] i am checking [06:53:46] <_joe_> 'wtf-live' is what was actually live on mediawiki-config [06:53:55] <_joe_> Dereckson: around by any chance? [06:55:32] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "Manually cleaning icinga isn't enough." [puppet] - 10https://gerrit.wikimedia.org/r/364504 (owner: 10Dzahn) [06:56:08] _joe_: I think it looks fine, [06:56:19] hi [06:56:57] <_joe_> legoktm: oh I didn't know you were in europe now [06:57:04] <_joe_> hi :) [06:58:02] the repo looks fine to me now [06:58:07] there shouldn't be merge commits in general [06:58:12] <_joe_> yeah, yeah [06:58:25] <_joe_> legoktm: it wasn't just a merge commit, I think I got what happened [06:58:35] <_joe_> tl;dr always do git pull --rebase [06:59:03] well you're not supposed to git pull according to the instructions [07:00:06] https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Step_2:_get_the_code_on_the_deployment_host git fetch origin; git log HEAD...origin/master; git rebase origin/master [07:00:25] <_joe_> legoktm: well pull is the same thing of fetching if you're in a fast-forwardable state [07:01:13] right. the point of doing it in three commands is so that you can inspect what you're going to rebase onto master after its been pulled but before it is in the main branch [07:02:02] (03PS2) 10Muehlenhoff: Restrict ores::web to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/364185 [07:04:21] <_joe_> legoktm: yeah i know :P [07:04:26] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1064 - T168661 (duration: 00m 44s) [07:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:38] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [07:10:40] :) [07:13:44] (03PS1) 10Ayounsi: Depool codfw for asw-b-codfw upgrade. [dns] - 10https://gerrit.wikimedia.org/r/364661 [07:17:46] (03CR) 10Marostegui: [C: 032] sanitarium3.my.cnf: Save binlogs 30 days [puppet] - 10https://gerrit.wikimedia.org/r/364659 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [07:19:29] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 196.07 seconds [07:22:30] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [07:22:39] RECOVERY - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [07:23:56] (03PS1) 10Ayounsi: Route traffic around codfw for asw-b-codfw upgrade. [puppet] - 10https://gerrit.wikimedia.org/r/364663 [07:27:44] !log Drop table localisation_file_hash from enwiki - T119811 [07:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:54] T119811: Drop localisation and localisation_file_hash tables, l10nwiki databases too - https://phabricator.wikimedia.org/T119811 [07:29:51] !log Drop table localisation_file_hash from testwiki and drop database l10nwiki on s3 - T119811 [07:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:58] 10Operations, 10hardware-requests: codfw: (2) hardware access request for labtest - https://phabricator.wikimedia.org/T154664#3429145 (10faidon) [07:31:01] 10Operations, 10Cloud-Services, 10hardware-requests: Codfw: (2) hardware access request for labtest [region 2] - https://phabricator.wikimedia.org/T161766#3429144 (10faidon) [07:31:07] 10Operations, 10Cloud-Services, 10hardware-requests: eqiad: (1) hardware access request for labnodepool1002 - https://phabricator.wikimedia.org/T161753#3429148 (10faidon) [07:31:32] 10Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: New SCB nodes - https://phabricator.wikimedia.org/T166342#3429151 (10faidon) [07:33:00] (03CR) 10KartikMistry: [WIP] Make compact language links default for all Wikipedias except en and de (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364428 (owner: 10Amire80) [07:34:20] !log Rename table migrateuser_medium on db1094 and db1079 - T170310 [07:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:12] (03CR) 10Ema: [C: 031] Depool codfw for asw-b-codfw upgrade. [dns] - 10https://gerrit.wikimedia.org/r/364661 (owner: 10Ayounsi) [07:38:25] (03PS2) 10Ema: Route traffic around codfw for asw-b-codfw upgrade. [puppet] - 10https://gerrit.wikimedia.org/r/364663 (https://phabricator.wikimedia.org/T169345) (owner: 10Ayounsi) [07:38:40] (03CR) 10Ayounsi: [C: 032] Depool codfw for asw-b-codfw upgrade. [dns] - 10https://gerrit.wikimedia.org/r/364661 (owner: 10Ayounsi) [07:38:47] (03CR) 10Ema: [C: 031] Route traffic around codfw for asw-b-codfw upgrade. [puppet] - 10https://gerrit.wikimedia.org/r/364663 (https://phabricator.wikimedia.org/T169345) (owner: 10Ayounsi) [07:39:57] !log depooled codfw for T169345 [07:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:09] T169345: codfw row B switch upgrade - https://phabricator.wikimedia.org/T169345 [07:40:15] (03CR) 10Ayounsi: [C: 032] Route traffic around codfw for asw-b-codfw upgrade. [puppet] - 10https://gerrit.wikimedia.org/r/364663 (https://phabricator.wikimedia.org/T169345) (owner: 10Ayounsi) [07:40:21] (03PS3) 10Ayounsi: Route traffic around codfw for asw-b-codfw upgrade. [puppet] - 10https://gerrit.wikimedia.org/r/364663 (https://phabricator.wikimedia.org/T169345) [07:42:54] (03PS1) 10Muehlenhoff: CVE-2017-7529 [software/nginx] - 10https://gerrit.wikimedia.org/r/364664 [07:49:34] (03PS1) 10Marostegui: db2033.yaml: Use the new socket location [puppet] - 10https://gerrit.wikimedia.org/r/364665 (https://phabricator.wikimedia.org/T169510) [07:50:40] 10Operations, 10ops-esams, 10Patch-For-Review, 10User-fgiunchedi: Decommission esams ms-fe / ms-be - https://phabricator.wikimedia.org/T169518#3429166 (10fgiunchedi) I've reimaged all ms-be / ms-fe in esams and wiped data disks on the former, left to do is to wipe only the OS disks when the time comes for... [07:53:59] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [07:54:01] (03PS1) 10Ema: Temporarily remove achernar from lvs2* resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/364667 (https://phabricator.wikimedia.org/T169345) [07:54:03] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/364630 (owner: 10BBlack) [07:54:46] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=(citoid|restbase-async) [07:54:54] <_joe_> XioNoX: done! ^ [07:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:07] (03CR) 10Muehlenhoff: [C: 032] CVE-2017-7529 [software/nginx] - 10https://gerrit.wikimedia.org/r/364664 (owner: 10Muehlenhoff) [07:55:09] PROBLEM - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [07:56:00] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [07:56:15] <_joe_> godog: any idea what is that alarm about? ^^ [07:57:41] _joe_: not yet, I'm looking into recent uploads, I'm suspecting a batch or sth like that [07:57:51] <_joe_> ok [07:57:54] it started yesterday [07:58:06] <_joe_> so we do have more uploads than usual, correct? [07:58:32] yeah [07:58:59] PROBLEM - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [3000.0] [07:59:00] (03CR) 10Ema: [C: 032] Temporarily remove achernar from lvs2* resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/364667 (https://phabricator.wikimedia.org/T169345) (owner: 10Ema) [08:02:09] RECOVERY - mediawiki originals uploads -hourly- for eqiad-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [08:03:26] I've asked on -commons, will silence it in the meantime [08:04:00] RECOVERY - mediawiki originals uploads -hourly- for codfw-prod on graphite1001 is OK: OK: Less than 80.00% above the threshold [2000.0] [08:07:11] (03PS1) 10Ema: Temporarily remove achernar from lvs4* resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/364669 (https://phabricator.wikimedia.org/T169345) [08:08:18] 10Operations, 10ORES, 10Scoring-platform-team-Backlog, 10Graphite, 10User-fgiunchedi: Regularly purge old ores graphite metrics - https://phabricator.wikimedia.org/T169969#3429182 (10fgiunchedi) cc @Halfak @Ladsgroup [08:09:28] (03CR) 10Ema: [C: 032] Temporarily remove achernar from lvs4* resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/364669 (https://phabricator.wikimedia.org/T169345) (owner: 10Ema) [08:09:57] (03PS3) 10Jcrespo: mariadb: Fix systemd unit for controling multi-instances [software] - 10https://gerrit.wikimedia.org/r/364477 (https://phabricator.wikimedia.org/T169514) [08:11:13] !log Stop MySQL on db2033 (x1) - T169510 [08:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:25] T169510: Setup dbstore2002 with 2 new mysql instances from production and enable GTID - https://phabricator.wikimedia.org/T169510 [08:11:52] (03PS1) 10Jcrespo: mariadb: Delete override- changed default on package [puppet] - 10https://gerrit.wikimedia.org/r/364671 (https://phabricator.wikimedia.org/T168356) [08:13:12] (03PS2) 10Marostegui: db2033.yaml: Use the new socket location [puppet] - 10https://gerrit.wikimedia.org/r/364665 (https://phabricator.wikimedia.org/T169510) [08:14:20] (03CR) 10Marostegui: [C: 032] db2033.yaml: Use the new socket location [puppet] - 10https://gerrit.wikimedia.org/r/364665 (https://phabricator.wikimedia.org/T169510) (owner: 10Marostegui) [08:14:59] (03PS2) 10Jcrespo: mariadb: Delete override- changed default on package [puppet] - 10https://gerrit.wikimedia.org/r/364671 (https://phabricator.wikimedia.org/T168356) [08:16:16] (03CR) 10Jcrespo: [C: 032] mariadb: Fix systemd unit for controling multi-instances [software] - 10https://gerrit.wikimedia.org/r/364477 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [08:16:40] 10Operations, 10ops-codfw, 10Performance-Team, 10Thumbor, 10User-fgiunchedi: Rename mw2148 / mw2149 / mw2259 / mw2260 to thumbor200[1234] - https://phabricator.wikimedia.org/T168881#3429193 (10fgiunchedi) Indeed, hosts are gone from puppet(db) now, there are still renaming steps left on dc side [08:16:43] (03CR) 10Jcrespo: [C: 032] mariadb: Delete override- changed default on package [puppet] - 10https://gerrit.wikimedia.org/r/364671 (https://phabricator.wikimedia.org/T168356) (owner: 10Jcrespo) [08:18:49] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - api-https_443 - Could not depool server mw1290.eqiad.wmnet because of too many down!: search-https_9243 - Could not depool server elastic1018.eqiad.wmnet because of too many down!: wdqs_80 - Could not depool server wdqs1003.eqiad.wmnet because of too many down!: thumbor_8800 - Could not depool server thumbor1003.eqiad.wmnet because of too many down!: [08:18:49] Could not depool server ms-fe1008.eqiad.wmnet because of too many down!: appservers-https_443 - Could not depool server mw1258.eqiad.wmnet because of too many down!: rendering-https_443 - Could not depool server mw1298.eqiad.wmnet because of too many down!: kibana_80 - Could not depool server logstash1001.eqiad.wmnet because of too many down! [08:19:39] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [08:20:48] !log restarting asw-b-codfw for upgrade [08:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:02] <_joe_> ema: ^^ I think this kind of errors are pybal reaching its scalability limits [08:21:08] <_joe_> we saw the same the other day [08:21:11] _joe_: yeah [08:21:27] I don't think we've ever seen it alerting though right? [08:21:36] <_joe_> no, but this is worrying [08:21:40] very [08:21:41] I saw it yesterday [08:21:55] but it got fixed before i could do something [08:21:56] _joe_: did you added a new service yesterday? [08:22:15] <_joe_> volans: no, I should add it today [08:22:28] ok :D [08:22:56] so we're still performing one dns lookup for each and every check, which is no good and will be fixed with 1.13.7 [08:23:05] that might play a role [08:23:39] PROBLEM - Host mx2001 is DOWN: CRITICAL - Network Unreachable (208.80.153.45) [08:23:39] PROBLEM - Host alsafi is DOWN: CRITICAL - Network Unreachable (208.80.153.50) [08:23:49] PROBLEM - Host ganeti2003 is DOWN: PING CRITICAL - Packet loss = 100% [08:23:49] PROBLEM - Host pollux is DOWN: CRITICAL - Network Unreachable (208.80.153.43) [08:23:49] PROBLEM - Host serpens is DOWN: PING CRITICAL - Packet loss = 100% [08:23:49] PROBLEM - Host nihal is DOWN: PING CRITICAL - Packet loss = 100% [08:23:49] PROBLEM - Host oresrdb2001 is DOWN: PING CRITICAL - Packet loss = 100% [08:23:59] PROBLEM - Host hassaleh is DOWN: PING CRITICAL - Packet loss = 100% [08:23:59] PROBLEM - Host pybal-test2002 is DOWN: PING CRITICAL - Packet loss = 100% [08:24:00] PROBLEM - Host dbmonitor2001 is DOWN: PING CRITICAL - Packet loss = 100% [08:24:00] PROBLEM - Host kubetcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [08:24:00] PROBLEM - Host sca2003 is DOWN: PING CRITICAL - Packet loss = 100% [08:24:00] PROBLEM - Host ganeti2002 is DOWN: PING CRITICAL - Packet loss = 100% [08:24:00] PROBLEM - Host acrab is DOWN: PING CRITICAL - Packet loss = 100% [08:24:00] PROBLEM - Host install2002 is DOWN: PING CRITICAL - Packet loss = 100% [08:24:00] PROBLEM - Host kraz is DOWN: PING CRITICAL - Packet loss = 100% [08:24:01] PROBLEM - Host tureis is DOWN: PING CRITICAL - Packet loss = 100% [08:24:01] PROBLEM - Host zosma is DOWN: PING CRITICAL - Packet loss = 100% [08:24:02] PROBLEM - Host ganeti2001 is DOWN: PING CRITICAL - Packet loss = 100% [08:24:19] PROBLEM - Host ganeti2004 is DOWN: PING CRITICAL - Packet loss = 100% [08:24:24] PROBLEM - ElasticSearch health check for shards on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch http://10.2.1.30:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.2.1.30, port=9200): Read timed out. (read timeout=4) [08:24:29] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [08:24:29] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [08:24:29] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [08:24:39] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 110, down: 2, dormant: 0, excluded: 0, unused: 0BRet-0/0/1: down - Core: asw-b-codfw:et-2/0/51 {#10703} [40Gbps DF]BRae2: down - Core: asw-b-codfw:ae1BR [08:24:39] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 110, down: 2, dormant: 0, excluded: 0, unused: 0BRae2: down - Core: asw-b-codfw:ae2BRet-0/0/1: down - Core: asw-b-codfw:et-7/0/52 {#10707} [40Gbps DF]BR [08:24:40] PROBLEM - haproxy failover on dbproxy1005 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [08:24:40] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [08:24:49] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [08:24:49] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [08:24:49] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [08:25:09] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [08:25:09] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - apaches_80 - Could not depool server mw2231.codfw.wmnet because of too many down!: api-https_443 - Could not depool server mw2253.codfw.wmnet because of too many down!: api_80 - Could not depool server mw2221.codfw.wmnet because of too many down! [08:25:10] PROBLEM - Apache HTTP on mw2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:25:19] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [08:25:21] PROBLEM - configured eth on lvs2001 is CRITICAL: eth1 reporting no carrier. [08:25:21] PROBLEM - configured eth on lvs2003 is CRITICAL: eth1 reporting no carrier. [08:25:21] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [08:25:21] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [08:25:21] PROBLEM - configured eth on lvs2002 is CRITICAL: eth1 reporting no carrier. [08:25:24] dbproxy1005 is for db2030 [08:25:29] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [08:25:29] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [08:25:29] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero alive) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [08:25:39] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [08:25:39] PROBLEM - puppet last run on mw2192 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:25:40] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:26:08] <_joe_> the puppet failures are expected [08:26:10] RECOVERY - Apache HTTP on mw2017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 635 bytes in 0.115 second response time [08:26:12] <_joe_> nihal is in row b [08:26:12] elasticsearch codfw managed to page in the middle of all that [08:26:19] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:26:33] <_joe_> ema: yeah let's do that asap [08:26:49] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [08:27:49] PROBLEM - Host labtest-ns0.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [08:28:39] PROBLEM - puppet last run on thumbor2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:28:49] PROBLEM - puppet last run on mw2202 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:29:43] PROBLEM - LVS HTTP IPv4 on search.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 523 bytes in 0.173 second response time [08:29:43] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [08:29:43] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [08:29:43] PROBLEM - IPsec on cp4010 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [08:29:44] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [08:29:44] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp2012_v4, cp2012_v6 [08:29:44] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [08:29:49] PROBLEM - puppet last run on ms-be2025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:29:49] PROBLEM - IPsec on cp3031 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [08:29:49] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [08:29:49] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [08:29:49] PROBLEM - IPsec on cp4017 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [08:29:50] PROBLEM - IPsec on cp4009 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [08:29:50] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [08:29:51] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [08:29:59] PROBLEM - IPsec on cp4008 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [08:29:59] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [08:29:59] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [08:29:59] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [08:29:59] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp2012_v4, cp2012_v6 [08:29:59] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [08:30:00] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [08:30:09] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [08:30:09] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [08:30:10] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [08:30:19] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [08:30:19] PROBLEM - puppet last run on cp2022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:30:19] PROBLEM - IPsec on mc1024 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2024_v4 [08:30:19] PROBLEM - IPsec on mc1026 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2026_v4 [08:30:20] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp2012_v4, cp2012_v6 [08:30:20] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [08:30:20] PROBLEM - IPsec on cp4016 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [08:30:29] PROBLEM - IPsec on mc1025 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2025_v4 [08:30:29] PROBLEM - puppet last run on mc2019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:30:29] PROBLEM - IPsec on cp3040 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [08:30:29] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [08:30:29] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [08:30:29] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [08:30:30] PROBLEM - Host cp2007 is DOWN: PING CRITICAL - Packet loss = 100% [08:30:30] PROBLEM - Host cp2012 is DOWN: PING CRITICAL - Packet loss = 100% [08:30:31] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [08:30:39] PROBLEM - puppet last run on mc2020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:30:39] PROBLEM - IPsec on mc1023 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2023_v4 [08:30:39] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [08:30:40] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [08:30:40] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp2008_v4, cp2008_v6, cp2011_v4, cp2011_v6 [08:30:40] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 40 connecting: cp2007_v4, cp2007_v6, cp2010_v4, cp2010_v6 [08:30:50] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [08:30:50] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 54 ESP OK [08:30:50] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 54 ESP OK [08:30:51] RECOVERY - IPsec on cp4010 is OK: Strongswan OK - 44 ESP OK [08:30:51] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:30:51] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 54 ESP OK [08:30:51] RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 28 ESP OK [08:30:52] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 54 ESP OK [08:30:59] PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] [08:30:59] RECOVERY - Host cp2012 is UP: PING OK - Packet loss = 0%, RTA = 36.03 ms [08:30:59] RECOVERY - Host alsafi is UP: PING OK - Packet loss = 0%, RTA = 36.51 ms [08:30:59] RECOVERY - Host serpens is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms [08:30:59] RECOVERY - Host install2002 is UP: PING OK - Packet loss = 0%, RTA = 36.58 ms [08:30:59] RECOVERY - Host zosma is UP: PING OK - Packet loss = 0%, RTA = 36.32 ms [08:30:59] RECOVERY - Host sca2003 is UP: PING OK - Packet loss = 0%, RTA = 36.27 ms [08:32:14] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:14] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:14] PROBLEM - puppet last run on restbase2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:15] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:19] PROBLEM - puppet last run on ms-be2030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:19] PROBLEM - puppet last run on es2019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:50] PROBLEM - puppet last run on wtp2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:32:57] !log asw-b-codfw back up - T169345 [08:33:00] PROBLEM - puppet last run on cp2019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:08] T169345: codfw row B switch upgrade - https://phabricator.wikimedia.org/T169345 [08:33:09] RECOVERY - Host labtest-ns0.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 36.07 ms [08:33:09] PROBLEM - puppet last run on ms-be2018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:12] PROBLEM - puppet last run on labtestvirt2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:12] PROBLEM - puppet last run on mw2141 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:12] PROBLEM - puppet last run on labtestneutron2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:12] PROBLEM - puppet last run on labstore2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:12] PROBLEM - puppet last run on elastic2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:12] PROBLEM - puppet last run on wtp2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:12] PROBLEM - puppet last run on mw2133 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:12] PROBLEM - Restbase root url on restbase2002 is CRITICAL: connect to address 10.192.16.153 and port 7231: Connection refused [08:33:19] PROBLEM - puppet last run on mw2235 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:19] PROBLEM - puppet last run on elastic2032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:19] PROBLEM - puppet last run on wdqs2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:19] PROBLEM - puppet last run on kubernetes2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:20] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:20] PROBLEM - puppet last run on ganeti2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:20] PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:20] PROBLEM - puppet last run on mw2191 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:21] PROBLEM - puppet last run on ganeti2008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:21] PROBLEM - puppet last run on zosma is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:22] PROBLEM - puppet last run on rdb2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:22] PROBLEM - puppet last run on mw2132 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:29] PROBLEM - puppet last run on lvs2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:29] PROBLEM - puppet last run on mw2111 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:29] PROBLEM - puppet last run on mw2255 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:29] PROBLEM - puppet last run on cp2015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:29] PROBLEM - puppet last run on mc2025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:30] PROBLEM - puppet last run on wtp2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:30] PROBLEM - puppet last run on oresrdb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:31] PROBLEM - puppet last run on labtestnet2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:31] PROBLEM - puppet last run on mw2206 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:32] PROBLEM - puppet last run on labtestnet2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:49] PROBLEM - puppet last run on wtp2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:49] PROBLEM - puppet last run on mw2140 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:49] PROBLEM - puppet last run on mw2134 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:49] PROBLEM - Restbase root url on restbase2010 is CRITICAL: connect to address 10.192.16.185 and port 7231: Connection refused [08:33:50] PROBLEM - puppet last run on mw2212 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:59] PROBLEM - puppet last run on labstore2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:59] PROBLEM - puppet last run on mw2254 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:59] PROBLEM - puppet last run on wtp2014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:59] PROBLEM - puppet last run on mw2161 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:59] PROBLEM - puppet last run on mw2100 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:59] PROBLEM - puppet last run on mw2115 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:59] PROBLEM - puppet last run on mw2106 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:00] PROBLEM - puppet last run on mw2099 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:00] PROBLEM - puppet last run on mw2108 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:01] PROBLEM - puppet last run on mw2117 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:01] PROBLEM - puppet last run on mw2116 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:02] PROBLEM - puppet last run on ores2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:09] PROBLEM - puppet last run on mw2201 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:09] PROBLEM - puppet last run on wtp2007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:09] PROBLEM - puppet last run on kubetcd2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:09] PROBLEM - puppet last run on mw2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:09] PROBLEM - puppet last run on ganeti2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:10] PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:19] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [08:34:19] PROBLEM - puppet last run on ores2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:19] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [08:34:19] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:19] RECOVERY - carbon-frontend-relay metric drops on graphite1001 is OK: OK: Less than 80.00% above the threshold [25.0] [08:34:19] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:20] PROBLEM - puppet last run on ms-be2013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:29] PROBLEM - puppet last run on db2091 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:29] PROBLEM - puppet last run on elastic2031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:35:10] (03PS1) 10Alexandros Kosiaris: Monitoring: confine IPMI checks better to unbreak installs [puppet] - 10https://gerrit.wikimedia.org/r/364673 [08:37:39] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [08:37:39] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [08:37:49] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [08:37:49] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [08:37:59] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [08:37:59] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [08:38:00] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [08:38:19] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [08:38:29] RECOVERY - Restbase root url on restbase2002 is OK: HTTP OK: HTTP/1.1 200 - 15580 bytes in 0.111 second response time [08:38:29] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [08:38:49] RECOVERY - Restbase root url on restbase2007 is OK: HTTP OK: HTTP/1.1 200 - 15580 bytes in 0.119 second response time [08:38:59] RECOVERY - Restbase root url on restbase2010 is OK: HTTP OK: HTTP/1.1 200 - 15580 bytes in 0.111 second response time [08:40:59] PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:51:49] RECOVERY - puppet last run on rdb2003 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [08:52:20] RECOVERY - puppet last run on mw2106 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [08:52:39] RECOVERY - puppet last run on ms-be2018 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [08:52:39] RECOVERY - puppet last run on elastic2010 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [08:52:39] RECOVERY - puppet last run on thumbor2002 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [08:52:59] RECOVERY - puppet last run on labtestnet2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:52:59] RECOVERY - puppet last run on cp2009 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [08:52:59] RECOVERY - puppet last run on mw2192 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [08:53:00] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:53:09] RECOVERY - puppet last run on mw2127 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [08:53:10] RECOVERY - puppet last run on wtp2013 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [08:53:20] RECOVERY - puppet last run on wtp2012 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [08:53:20] RECOVERY - puppet last run on wtp2014 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [08:53:20] RECOVERY - puppet last run on mw2161 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [08:53:29] RECOVERY - puppet last run on ores2006 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [08:53:30] RECOVERY - puppet last run on ganeti2003 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [08:53:39] RECOVERY - puppet last run on ores2005 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [08:53:39] RECOVERY - puppet last run on labtestvirt2001 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [08:53:39] RECOVERY - puppet last run on mw2141 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [08:53:39] RECOVERY - puppet last run on labtestneutron2001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [08:53:39] RECOVERY - puppet last run on wtp2006 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [08:53:39] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [08:53:40] RECOVERY - puppet last run on elastic2032 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [08:53:40] RECOVERY - puppet last run on mw2235 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [08:53:41] RECOVERY - puppet last run on kubernetes2001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [08:53:41] RECOVERY - puppet last run on es2019 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [08:53:42] RECOVERY - puppet last run on ms-be2025 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [08:53:49] RECOVERY - puppet last run on ganeti2008 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [08:53:49] RECOVERY - puppet last run on mw2111 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [08:53:50] RECOVERY - puppet last run on wtp2004 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [08:53:59] RECOVERY - puppet last run on labtestnet2002 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [08:53:59] RECOVERY - puppet last run on mw2188 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [08:54:09] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [08:54:19] RECOVERY - puppet last run on mc2019 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [08:54:19] RECOVERY - puppet last run on mw2212 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [08:54:20] RECOVERY - puppet last run on db2074 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [08:54:29] RECOVERY - puppet last run on mc2020 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [08:54:30] RECOVERY - puppet last run on mw2017 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [08:54:30] RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [08:54:39] RECOVERY - puppet last run on restbase2012 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [08:54:40] RECOVERY - puppet last run on wdqs2001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [08:54:49] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [08:54:49] RECOVERY - puppet last run on db2091 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [08:54:50] RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [08:54:50] RECOVERY - puppet last run on elastic2031 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [08:54:59] RECOVERY - puppet last run on mw2255 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:54:59] RECOVERY - puppet last run on cp2015 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [08:55:09] RECOVERY - puppet last run on labtestneutron2002 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [08:55:09] RECOVERY - puppet last run on cp2022 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [08:55:09] RECOVERY - puppet last run on wtp2002 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [08:55:29] RECOVERY - puppet last run on mw2099 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [08:55:30] RECOVERY - puppet last run on mw2108 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [08:55:39] RECOVERY - puppet last run on wtp2007 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [08:55:39] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [08:55:40] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [08:55:49] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [08:55:49] RECOVERY - puppet last run on ms-be2030 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [08:55:49] RECOVERY - puppet last run on cp2006 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [08:55:49] RECOVERY - puppet last run on mw2191 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [08:55:50] RECOVERY - puppet last run on zosma is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [08:55:59] RECOVERY - puppet last run on lvs2006 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [08:55:59] RECOVERY - puppet last run on mc2025 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [08:55:59] RECOVERY - puppet last run on oresrdb2001 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [08:55:59] RECOVERY - puppet last run on mw2206 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [08:56:09] RECOVERY - puppet last run on mw2109 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [08:56:10] RECOVERY - puppet last run on mw2130 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [08:56:19] RECOVERY - puppet last run on mw2140 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [08:56:29] RECOVERY - puppet last run on mw2100 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [08:56:29] RECOVERY - puppet last run on mw2116 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [08:56:39] RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [08:56:39] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [08:56:49] RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [08:56:49] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [08:57:09] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [08:57:19] RECOVERY - puppet last run on mw2134 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [08:57:20] RECOVERY - puppet last run on labstore2001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [08:57:29] RECOVERY - puppet last run on mw2115 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [08:57:35] 10Operations, 10ORES, 10Scoring-platform-team-Backlog, 10Graphite, 10User-fgiunchedi: Regularly purge old ores graphite metrics - https://phabricator.wikimedia.org/T169969#3429269 (10Ladsgroup) @fgiunchedi : Hey, How we can purge scores? I couldn't find anything in wikitech. [08:57:39] RECOVERY - puppet last run on cp2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:57:49] RECOVERY - puppet last run on labstore2003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [08:57:49] RECOVERY - puppet last run on mw2133 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [08:57:49] RECOVERY - puppet last run on ms-be2013 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [08:57:59] RECOVERY - puppet last run on mw2132 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [08:58:09] RECOVERY - puppet last run on lvs4004 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [08:58:17] !log uploading nginx 1.11.10-1+wmf3 for jessie-wikimedia/stretch-wikimedia [08:58:17] (03PS1) 10Ema: Revert "Temporarily remove achernar from lvs2* resolv.conf" [puppet] - 10https://gerrit.wikimedia.org/r/364674 [08:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:29] RECOVERY - puppet last run on mw2254 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [08:58:29] RECOVERY - puppet last run on mw2117 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [08:58:30] (03PS1) 10Ema: Revert "Temporarily remove achernar from lvs4* resolv.conf" [puppet] - 10https://gerrit.wikimedia.org/r/364675 [08:58:39] RECOVERY - puppet last run on kubetcd2003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [08:58:49] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [08:58:49] RECOVERY - puppet last run on ganeti2004 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [08:59:09] RECOVERY - puppet last run on ganeti2002 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:00:39] RECOVERY - puppet last run on mw2201 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:01:09] (03CR) 10Ema: [C: 032] Revert "Temporarily remove achernar from lvs4* resolv.conf" [puppet] - 10https://gerrit.wikimedia.org/r/364675 (owner: 10Ema) [09:01:21] (03PS2) 10Ema: Revert "Temporarily remove achernar from lvs4* resolv.conf" [puppet] - 10https://gerrit.wikimedia.org/r/364675 [09:01:23] (03CR) 10Ema: [V: 032 C: 032] Revert "Temporarily remove achernar from lvs4* resolv.conf" [puppet] - 10https://gerrit.wikimedia.org/r/364675 (owner: 10Ema) [09:01:38] (03PS2) 10Ema: Revert "Temporarily remove achernar from lvs2* resolv.conf" [puppet] - 10https://gerrit.wikimedia.org/r/364674 [09:01:45] (03CR) 10Ema: [V: 032 C: 032] Revert "Temporarily remove achernar from lvs2* resolv.conf" [puppet] - 10https://gerrit.wikimedia.org/r/364674 (owner: 10Ema) [09:03:39] (03PS1) 10Ayounsi: Revert "Route traffic around codfw for asw-b-codfw upgrade." [puppet] - 10https://gerrit.wikimedia.org/r/364677 [09:04:26] (03PS1) 10Ayounsi: Revert "Depool codfw for asw-b-codfw upgrade." [dns] - 10https://gerrit.wikimedia.org/r/364678 [09:06:16] (03CR) 10Ayounsi: [C: 032] Revert "Depool codfw for asw-b-codfw upgrade." [dns] - 10https://gerrit.wikimedia.org/r/364678 (owner: 10Ayounsi) [09:07:21] (03PS2) 10Ayounsi: Revert "Route traffic around codfw for asw-b-codfw upgrade." [puppet] - 10https://gerrit.wikimedia.org/r/364677 [09:08:39] (03CR) 10Ayounsi: [C: 032] Revert "Route traffic around codfw for asw-b-codfw upgrade." [puppet] - 10https://gerrit.wikimedia.org/r/364677 (owner: 10Ayounsi) [09:08:49] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [09:09:29] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3570 bytes in 0.014 second response time [09:10:23] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: codfw row B switch upgrade - https://phabricator.wikimedia.org/T169345#3429337 (10ayounsi) 05Open>03Resolved Switch went down for about 10min and came back up properly. Some notes: - The upgrade was more smooth than using NSSU - If a ganeti* h... [09:11:22] 10Operations, 10netops, 10Patch-For-Review: deploy diffscan2 - https://phabricator.wikimedia.org/T169624#3429339 (10ayounsi) 05Open>03Resolved a:03ayounsi Diffscan has been running smoothly. [09:13:02] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: codfw row B switch upgrade - https://phabricator.wikimedia.org/T169345#3429342 (10ema) >>! In T169345#3429337, @ayounsi wrote: > - The only page was "search.svc.codfw.wmnet/LVS HTTP IPv4 is CRITICAL" Preceded by "search.svc.codfw.wmnet/ElasticSearch... [09:13:19] PROBLEM - HHVM rendering on mw2125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:13:27] (03PS1) 10Jcrespo: mariadb: Transform dbstore2002 into multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364681 (https://phabricator.wikimedia.org/T169510) [09:14:19] RECOVERY - HHVM rendering on mw2125 is OK: HTTP OK: HTTP/1.1 200 OK - 75654 bytes in 0.294 second response time [09:15:19] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: codfw row B switch upgrade - https://phabricator.wikimedia.org/T169345#3395298 (10dcausse) The number of shards never reached the critical threshold, in irc I've seen: `10:24 PROBLEM - ElasticSearch health check for shards on search.svc.cod... [09:17:28] (03PS1) 10Muehlenhoff: Remove sshd options specific to SSH protocol 1 [puppet] - 10https://gerrit.wikimedia.org/r/364682 (https://phabricator.wikimedia.org/T170298) [09:17:52] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364683 [09:18:05] (03CR) 10Marostegui: [C: 04-2] "Wait until the last alters are finished" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364683 (owner: 10Marostegui) [09:18:59] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [09:19:39] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3570 bytes in 0.005 second response time [09:19:45] !log oblivian@puppetmaster1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=(citoid|restbase-async) [09:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:58] (03PS3) 10WMDE-leszek: Update my ssh key [puppet] - 10https://gerrit.wikimedia.org/r/363180 (owner: 10Aude) [09:20:21] (03PS2) 10Jcrespo: mariadb: Transform dbstore2002 into multi-instance, drop db1096 [puppet] - 10https://gerrit.wikimedia.org/r/364681 (https://phabricator.wikimedia.org/T169510) [09:22:15] (03PS3) 10Elukey: role::mariadb::analytics::custom_repl_slave: add EventLogging cleaner user [puppet] - 10https://gerrit.wikimedia.org/r/364412 (https://phabricator.wikimedia.org/T170118) [09:22:40] marostegui, jynus going to merge ---^ if you are ok [09:22:55] (and possibly the eventlogging cleaner script afterwards) [09:23:05] elukey: fine by me [09:23:32] elukey: if possible, avoid running the script on dbstore1002 till tomorrow though [09:23:42] elukey: https://phabricator.wikimedia.org/T166204#3429346 [09:23:51] marostegui: poor dbstore1002, I'll hammer db1047 :D [09:23:56] yep I saw the task thansk! [09:24:01] *thanks [09:24:19] (03CR) 10Muehlenhoff: [C: 031] admin: Remove deployers from restricted group [puppet] - 10https://gerrit.wikimedia.org/r/364469 (https://phabricator.wikimedia.org/T104671) (owner: 10Dereckson) [09:30:33] PROBLEM - MariaDB Slave Lag: s1 on db1066 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 622.53 seconds [09:30:46] page [09:31:04] yep [09:31:07] that host is depooled [09:31:10] lost downtime I think [09:31:25] ok :( [09:31:28] Because it started the alters yesterday and I think I downtimed it for 2 days [09:32:05] <_joe_> jesus [09:33:10] I would welcome a fresh eye into looking into this "icinga loses downtimes stuff" [09:34:38] akosiaris: I will take a look too [09:35:02] (03CR) 10Giuseppe Lavagetto: [C: 031] Query and grammar: add support for aliases [software/cumin] - 10https://gerrit.wikimedia.org/r/363748 (https://phabricator.wikimedia.org/T169640) (owner: 10Volans) [09:35:23] I set it as unbreak now in the hope that someone could be assigned to it with high priority (because we both are busy with goals and stuff) [09:37:25] it could have been worse, it stopped using downtime and using disable notifications for most down times [09:38:57] I can see in the logs I did downtime it till tomorrow [09:39:07] so, yes, it is a lost downtime [09:40:29] 10Operations, 10ORES, 10Scoring-platform-team-Backlog, 10Graphite, 10User-fgiunchedi: Regularly purge old ores graphite metrics - https://phabricator.wikimedia.org/T169969#3429416 (10fgiunchedi) @Ladsgroup this would be all old graphite metrics for ores not just scores, anyways what we do is setup a cron... [09:40:59] 10Operations, 10Traffic, 10netops: codfw row C switch upgrade - https://phabricator.wikimedia.org/T170380#3429417 (10ayounsi) [09:41:42] (03CR) 10Elukey: [C: 032] role::mariadb::analytics::custom_repl_slave: add EventLogging cleaner user [puppet] - 10https://gerrit.wikimedia.org/r/364412 (https://phabricator.wikimedia.org/T170118) (owner: 10Elukey) [09:43:16] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3429455 (10Marostegui) Another one experienced today: 12 Jul db1066 These are the downtimes logs: ``` [1499749949] EXTERNAL COMMAND: SCHEDULE... [09:44:54] (03CR) 10Giuseppe Lavagetto: [C: 031] QueryBuilder: fix subgroup close at the end of query [software/cumin] - 10https://gerrit.wikimedia.org/r/363749 (owner: 10Volans) [09:49:04] (03CR) 10Alexandros Kosiaris: [C: 032] Monitoring: confine IPMI checks better to unbreak installs [puppet] - 10https://gerrit.wikimedia.org/r/364673 (owner: 10Alexandros Kosiaris) [09:49:09] (03PS2) 10Alexandros Kosiaris: Monitoring: confine IPMI checks better to unbreak installs [puppet] - 10https://gerrit.wikimedia.org/r/364673 [09:49:11] (03CR) 10Volans: [C: 031] "LGTM as a quickfix, to be improved later" [puppet] - 10https://gerrit.wikimedia.org/r/364673 (owner: 10Alexandros Kosiaris) [09:49:13] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Monitoring: confine IPMI checks better to unbreak installs [puppet] - 10https://gerrit.wikimedia.org/r/364673 (owner: 10Alexandros Kosiaris) [09:49:24] (03CR) 10Marostegui: [C: 031] "Looks good, I will give you a quick overview, as this server is manually configured, you might not be aware of its setup now." [puppet] - 10https://gerrit.wikimedia.org/r/364681 (https://phabricator.wikimedia.org/T169510) (owner: 10Jcrespo) [09:51:49] (03CR) 10Giuseppe Lavagetto: [C: 031] "This will break switchdc as well, AIUI. Do you have a corresponding patch already?" [software/cumin] - 10https://gerrit.wikimedia.org/r/363750 (owner: 10Volans) [09:52:44] RECOVERY - MariaDB Slave Lag: s1 on db1066 is OK: OK slave_sql_lag Replication lag: 0.37 seconds [09:58:00] (03CR) 10Giuseppe Lavagetto: [C: 031] puppet-run: manual splay at the top (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/364630 (owner: 10BBlack) [09:58:09] (03PS1) 10Filippo Giunchedi: graphite: keep 'servers' hierarchy for 60d [puppet] - 10https://gerrit.wikimedia.org/r/364687 (https://phabricator.wikimedia.org/T169972) [09:58:22] 10Operations, 10Graphite, 10Patch-For-Review, 10User-fgiunchedi: Delete "servers" metrics in graphite older than 60d - https://phabricator.wikimedia.org/T169972#3429501 (10fgiunchedi) @Joe yes we are but stopping using diamond collector for server data isn't in scope for this task [09:59:03] (03CR) 10Jcrespo: [C: 032] "Thanks, I will stop all services, but hopefully not destroy anything." [puppet] - 10https://gerrit.wikimedia.org/r/364681 (https://phabricator.wikimedia.org/T169510) (owner: 10Jcrespo) [09:59:09] (03PS3) 10Jcrespo: mariadb: Transform dbstore2002 into multi-instance, drop db1096 [puppet] - 10https://gerrit.wikimedia.org/r/364681 (https://phabricator.wikimedia.org/T169510) [10:00:04] Dereckson: Dear anthropoid, the time has come. Please deploy Create new wikis (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170712T1000). [10:00:05] 10Operations, 10Traffic, 10netops: codfw row C switch upgrade - https://phabricator.wikimedia.org/T170380#3429506 (10Marostegui) From the db side: * db1031 needs to be downtimed as it is db2033's x1 master and will page with replication broken once db2033 becomes unreachable * We could downtime all the aff... [10:01:06] (03CR) 10Jcrespo: "Putting the table back into place." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/364630 (owner: 10BBlack) [10:03:28] 10Operations, 10Traffic, 10netops: Recurring varnish-be fetch failures in codfw - https://phabricator.wikimedia.org/T170131#3429513 (10ema) The fetch errors last precisely 240 seconds on each machine. A quick look at our varnish-be settings seems to open two routes for investigation: - `thread_pool_timeout... [10:04:53] !log installing nginx security updates on mw* canaries [10:04:55] (03CR) 10Alexandros Kosiaris: [C: 031] "oh my my ...." [puppet] - 10https://gerrit.wikimedia.org/r/364630 (owner: 10BBlack) [10:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:18] (03PS27) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) [10:07:09] (03CR) 10jerkins-bot: [V: 04-1] role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [10:08:38] (03PS28) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) [10:08:53] argh didn't see jenkins before re-submitting, another -1 is coming [10:09:31] yeah jenkins you are right, bad rebase [10:09:36] (03CR) 10jerkins-bot: [V: 04-1] role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [10:10:27] Hello [10:11:21] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 5502.78 seconds [10:13:12] I will disable that check there [10:15:01] I've to create three db and their tables on s3, no impact? [10:16:21] Dereckson: did you use: if not exists? [10:27:30] marostegui: normally yes, but we can find an extension or another one still missing such statements [10:30:08] Dereckson: ok, if not, it might break replication on dbstore servers, not a big deal, but we'll see [10:31:35] !log stopping all mysql instances on dbstore2002 and doing an in-place upgrade [10:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:10] (03PS29) 10Elukey: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) [10:38:43] can someone please run https://phabricator.wikimedia.org/T170176 ? Thanks [10:40:09] <_joe_> TabbyCat: it seems it was ran [10:40:12] <_joe_> from the comments [10:40:23] <_joe_> oh in dry-run [10:40:31] _joe_: yep, but could it be non-dry-run? [10:40:32] (03CR) 10Elukey: [C: 032] role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [10:40:36] that ;) [10:41:31] <_joe_> I'd wait for Dereckson to respond, tbh [10:41:39] no prob [10:42:46] * Dereckson checks that [10:43:42] <_joe_> Dereckson: thanks :) [10:43:51] <_joe_> I can run the script if it's ok to do so [10:44:02] <_joe_> I think it is, from the test bd808 did [10:44:50] oh Dereckson now that you're here, apparently there are problems with the logos we uploaded to sr.wikiquote - I CC'd you there because I see them well and followed the procedure [10:45:23] !log upgrading nginx on meitnerium/archiva.wikimedia.org [10:45:27] _joe_: yes, it's ok [10:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:57] <_joe_> !log running namespaceDupes.php on eswiki, T170176 [10:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:09] T170176: Run namespaceDupes.php on es.wikipedia - https://phabricator.wikimedia.org/T170176 [10:47:53] <_joe_> uhm the script runs in dry run by default [10:48:09] --fix [10:48:13] <_joe_> now looking at which switch I should be changing to actually do it [10:48:23] <_joe_> TabbyCat: hah thanks, I was looking at the source [10:48:44] <_joe_> {{done}} [10:48:51] :D [10:51:20] _joe_: run it in dry run mode once more time and check if there is still some links remaining [10:51:33] <_joe_> Dereckson: already done [10:51:37] <_joe_> :) [10:51:54] (03PS1) 10Elukey: role::mariadb::analytics::custom_repl_slave: deploy the EL whitelist [puppet] - 10https://gerrit.wikimedia.org/r/364691 (https://phabricator.wikimedia.org/T108850) [10:52:07] (had that occured, that would have been the ones someone on es.wikipedia would have been to fix) [10:54:29] !log installing nginx updates on ms1001/dataset1001 [10:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:48] (03PS4) 10Dereckson: Initial configuration for Dinka Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362168 (https://phabricator.wikimedia.org/T168518) (owner: 10Urbanecm) [10:57:19] (03CR) 10Dereckson: [C: 032] Initial configuration for Dinka Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362168 (https://phabricator.wikimedia.org/T168518) (owner: 10Urbanecm) [10:58:18] (03Merged) 10jenkins-bot: Initial configuration for Dinka Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362168 (https://phabricator.wikimedia.org/T168518) (owner: 10Urbanecm) [10:58:27] (03CR) 10jenkins-bot: Initial configuration for Dinka Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362168 (https://phabricator.wikimedia.org/T168518) (owner: 10Urbanecm) [11:00:40] (03CR) 10Dereckson: Initial configuration for maiwikimedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361297 (https://phabricator.wikimedia.org/T168782) (owner: 10Urbanecm) [11:03:30] (03CR) 10Elukey: [C: 032] role::mariadb::analytics::custom_repl_slave: deploy the EL whitelist [puppet] - 10https://gerrit.wikimedia.org/r/364691 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [11:03:56] <_joe_> Dereckson: btw, did you try to merge manually some patch yesterday evening? I had to fixup the git status on tin as it was in a detached head state [11:04:04] <_joe_> and we had a merge commit on master [11:04:04] !log dereckson@tin Synchronized dblists: Create din.wikipedia (duration: 00m 49s) [11:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:42] <_joe_> and SAL says you were the last to sync something. you can see my actions in the SAL [11:04:46] _joe_: hu, thought I reseted the branch to origin/master (was a WIP commit by ostriches) [11:04:58] <_joe_> hah! gotcha. [11:05:05] <_joe_> as I thought [11:05:11] 767e0467 HEAD@{6}: checkout: moving from wtf-master to master [11:05:36] okay I see what you've done [11:05:36] <_joe_> you did 'git checkout ' instead of 'git reset --hard HEAD~1' [11:05:52] <_joe_> if I had to guess [11:06:17] git reset origin/master without the --hard flag to avoid to delete the new scap plugin if in use [11:06:29] <_joe_> oh [11:06:32] <_joe_> I did delete it [11:06:55] well I notified yesterday Chad + the hash if there is a need to recover the commit [11:06:56] <_joe_> because there is really no way not to remove it if it's committed to git [11:07:06] <_joe_> --hard doesn't remove unregistered files [11:07:11] ok [11:07:21] <_joe_> the commit is in wtf-master [11:07:28] <_joe_> so thanks for trying to fix that :) [11:07:54] but next time I do an hard reset ok [11:08:19] PROBLEM - puppet last run on db1047 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/eventlogging/whitelist.tsv] [11:08:21] <_joe_> actually the git reset should work and leave the file uncommitted [11:08:34] db1047 is me [11:08:36] fixing sorry [11:08:44] pebcak [11:09:01] (03PS1) 10Elukey: role::mariadb::analytics::custom_repl_slave: fix typo in whitelist path [puppet] - 10https://gerrit.wikimedia.org/r/364692 (https://phabricator.wikimedia.org/T170118) [11:09:57] (03CR) 10Elukey: [C: 032] role::mariadb::analytics::custom_repl_slave: fix typo in whitelist path [puppet] - 10https://gerrit.wikimedia.org/r/364692 (https://phabricator.wikimedia.org/T170118) (owner: 10Elukey) [11:11:31] RECOVERY - puppet last run on db1047 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [11:11:51] !log installing tomcat security updates [11:11:53] Dereckson: I've noticed your comment about https://gerrit.wikimedia.org/r/#/c/361297/. Should I remove that line? [11:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:35] Urbanecm: yes, you can [11:13:53] (03PS8) 10Urbanecm: Initial configuration for maiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361297 (https://phabricator.wikimedia.org/T168782) [11:14:03] Dereckson: Ok, removed (see above) [11:14:17] Urbanecm: k [11:16:05] Reedy: actually, we seem to need --wiki=aawiki in the wiki add script, if not there is a "no version entry for `din`." message [11:16:12] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:16:33] even with din in langlist and dinwiki in wikiversions.json pulled on Terbium [11:16:57] you sync'ed the config before to create the last ones when you used it without the flag? [11:17:01] PROBLEM - puppet last run on dbstore1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/eventlogging/whitelist.tsv] [11:17:32] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:17:41] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:17:41] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:17:41] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:17:42] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:17:51] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:18:02] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [11:18:05] !log dereckson@tin rebuilt wikiversions.php and synchronized wikiversions files: +dinwiki [11:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:31] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:19:21] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [11:19:31] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [11:19:31] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [11:19:32] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [11:19:32] RECOVERY - MariaDB Slave IO: m3 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [11:19:32] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [11:19:41] RECOVERY - MariaDB Slave IO: s7 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [11:20:11] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [11:20:25] !log dereckson@tin Synchronized langlist: +din (duration: 00m 46s) [11:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:11] are the dbstore1001's alarms related to mw exceptions? [11:21:51] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Initial configuration for din.wikipedia (thanks Urbanecm) (T168518) (duration: 00m 46s) [11:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:01] T168518: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518 [11:22:52] Cannot access the database: Unknown database 'dinwiki' [11:23:01] Dereckson: --^ [11:23:41] it seems passed but all those execptions were probably related to your work? [11:26:09] 10Operations, 10Traffic, 10netops: codfw row C switch upgrade - https://phabricator.wikimedia.org/T170380#3429417 (10fgiunchedi) re: graphite machines, we'll take the 10 min hit, ditto mwlog [11:26:18] marostegui: ^ [11:26:35] elukey: 10:30 < marostegui> Dereckson: ok, if not, it might break replication on dbstore servers, not a big deal, but we'll see [11:26:49] elukey: (it = the lack of a "IF EXISTS" statement) [11:27:07] Yes, I've just created it [11:27:16] ah okok, I saw also fatals and just wanted to double check :) [11:27:28] Dereckson: thanks for creating din.wikipedia.org ! [11:27:38] current fatals seems related to Lua, and for some days [11:27:38] 234 LuaSandboxFunction::call(): recursion detected in /srv/mediawiki/php-1.30.0-wmf.7/extensions/Scribunto/engines/LuaSandbox/Engine.php on line 312 [11:27:54] Dereckson: https://github.com/wikimedia/operations-mediawiki-config/blob/a4ba01e620ec5d56314e05db485ee057cddaac46/multiversion/MWScript.php#L69 [11:27:59] It really *shouldn't* be needed [11:28:23] !log installing spice security updates [11:28:27] I wonder if parameters are being eaten [11:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:33] # Maintenance.php will treat $argv[1] as the wiki if it doesn't start '-' [11:28:46] $argv[1] was so "din" and not "dinwiki" [11:29:04] perhaps change the order of arguments in the script in this case: db lang family [11:29:07] instead of lang family db [11:29:16] Dereckson: Maybe we should just remove that "feature" completely from multiversion [11:29:19] I'm not sure anyone really uses it [11:30:59] Or, is it just broken [11:31:06] marostegui: jynus: green light to still create two new databases or you want first to fix the lack of dinwiki on dbstore? [11:31:54] do whatevery you must [11:32:39] elukey: where did you find that? [11:34:13] (03CR) 10Amire80: [WIP] Make compact language links default for all Wikipedias except en and de (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364428 (owner: 10Amire80) [11:34:18] (03PS1) 10Elukey: role::mariadb::analytics::custom_repl_slave: correct owner of el cleaner [puppet] - 10https://gerrit.wikimedia.org/r/364696 (https://phabricator.wikimedia.org/T170118) [11:34:47] !log Run add wiki maintenance script for dinwiki database / din.wikipedia.org (T168518) [11:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:58] T168518: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518 [11:35:04] (03PS9) 10Dereckson: Initial configuration for maiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361297 (https://phabricator.wikimedia.org/T168782) (owner: 10Urbanecm) [11:35:15] (03CR) 10Dereckson: [C: 032] Initial configuration for maiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361297 (https://phabricator.wikimedia.org/T168782) (owner: 10Urbanecm) [11:35:32] jynus: it was on a mw fatal that I checked in logstash [11:36:08] (03Merged) 10jenkins-bot: Initial configuration for maiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361297 (https://phabricator.wikimedia.org/T168782) (owner: 10Urbanecm) [11:36:09] (but it was not dbstore1001, so I checked to see what was happening) [11:36:18] (03CR) 10jenkins-bot: Initial configuration for maiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361297 (https://phabricator.wikimedia.org/T168782) (owner: 10Urbanecm) [11:36:18] elukey: if unique, could be a visit on din.wikipedia.org by a visitor receiving the mail "a wiki has been created" (but was only reachable through mwdebug1002 for a few minutes) [11:36:42] (03CR) 10Elukey: [C: 032] role::mariadb::analytics::custom_repl_slave: correct owner of el cleaner [puppet] - 10https://gerrit.wikimedia.org/r/364696 (https://phabricator.wikimedia.org/T170118) (owner: 10Elukey) [11:36:46] well not at this stage dinwiki existed [11:37:18] (03PS3) 10Amire80: [WIP] Make compact language links default for all Wikipedias except en and de [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364428 [11:37:33] (03PS4) 10Amire80: [WIP] Make compact language links default for all Wikipedias except en and de [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364428 [11:37:41] RECOVERY - puppet last run on dbstore1002 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [11:38:11] Wikimedia\Rdbms\DBConnectionError from line 796 of /srv/mediawiki/php-1.30.0-wmf.7/includes/libs/rdbms/database/Database.php: Cannot access the database: Unknown database 'dinwiki' [11:38:19] here your error elukey [11:38:27] thrown by Special:UserLogin [11:38:54] yep I didn't mean to blame anybody Dereckson, I was just checking to make sure that it was expected :) [11:39:43] I synced something too long before run the create database script [11:40:03] Let me add a note on the Add a wiki page about that [11:42:11] elukey: https://wikitech.wikimedia.org/w/index.php?title=Add_a_wiki&type=revision&diff=1764277&oldid=1764269 should avoid that in the future [11:42:25] (03PS5) 10Amire80: [WIP] Make compact language links default for all Wikipedias except en and de [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364428 [11:47:00] Urbanecm: hey, https://gerrit.wikimedia.org/r/#/c/361297/9/wikiversions.json [11:47:31] Urbanecm: you can use `arc lint` to detect that, with a .arclint config containing the "merge" linter [11:48:35] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:48:54] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:53] Dereckson: Should I do anything with it? [11:51:26] (03PS1) 10Dereckson: Fix wikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364699 [11:51:27] Dereckson, Urbanecm - do you happen to know how long will the import of articles from the Incubator to dinwiki take? Who performs this usually? [11:51:41] Urbanecm: yes, use `arc lint` with a .arclint config containing merge (I'll write one for the repo if you wish) [11:51:53] Dereckson: Will it work on ysul? [11:51:57] it will work [11:52:02] !log installing apache security updates on mw* [11:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:16] Ok. Please write one for me. [11:52:25] arc has a lot of linters to detect problems like merge commit [11:52:38] or we could also perhaps add a check in the Jenkins tests [11:53:15] (03CR) 10Dereckson: [C: 032] Fix wikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364699 (owner: 10Dereckson) [11:53:38] aharoni: It is performed by new wiki importers. You can watch the status on https://incubator.wikimedia.org/wiki/Incubator:Site_creation_log . Dereckson is creating the wiki now as empty one and the import will be made a few days later. [11:53:53] SPQRobin can help [11:54:10] (03Merged) 10jenkins-bot: Fix wikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364699 (owner: 10Dereckson) [11:54:17] ah I see currently, it's MF-Warburg who perform them [11:54:35] In this year, yes. [11:55:30] Dereckson: It says No paths are lintable. [11:55:47] (I'm in mediawiki-config's root directory) [11:56:14] (03CR) 10jenkins-bot: Fix wikiversions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364699 (owner: 10Dereckson) [11:57:11] !log Run add wiki maintenance script for maiwikimedia database / mai.wikimedia.org (T168782) [11:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:22] T168782: Create fishbowl wiki for Maithili Wikimedians User Group - https://phabricator.wikimedia.org/T168782 [11:57:29] Dereckson: please comment on the wiki creation tickets so we can sanitize those on labs once you are done with their creation [11:57:34] Urbanecm: only current modifications are, and there is a need to write some .arclint files, we'll check that afterwards [11:57:37] marostegui: ack'ed [11:57:47] Dereckson: On the storage ones which are the ones that have DBA tags - thanks! [11:58:02] Dereckson: Ack'ed, I'll wait. [12:00:47] (03PS1) 10Elukey: eventlogging_cleaner.py: fix some runtime issues [puppet] - 10https://gerrit.wikimedia.org/r/364701 [12:01:43] !log dereckson@tin Synchronized dblists: +maiwikimedia (duration: 00m 46s) [12:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:15] !log dereckson@tin rebuilt wikiversions.php and synchronized wikiversions files: +maiwiki [12:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:40] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Initial configuration for mai.wikimedia (T168782) (duration: 00m 46s) [12:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:49] T168782: Create fishbowl wiki for Maithili Wikimedians User Group - https://phabricator.wikimedia.org/T168782 [12:04:43] !log dereckson@tin Synchronized multiversion/MWMultiVersion.php: +mai.wikimedia new subdomain (duration: 00m 46s) [12:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:16] 10Operations, 10RESTBase, 10RESTBase-Cassandra, 10Patch-For-Review, 10Services (watching): column family cassandra metrics size - https://phabricator.wikimedia.org/T113733#3430149 (10fgiunchedi) 05Open>03Resolved Yes we can! e.g. for `restbase1009-a/org/apache/cassandra/metrics/ColumnFamily/local_gro... [12:05:19] 10Blocked-on-Operations, 10Operations, 10Cassandra, 10RESTBase, and 2 others: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#3430151 (10fgiunchedi) [12:07:12] (03CR) 10Faidon Liambotis: [C: 032] Remove sshd options specific to SSH protocol 1 [puppet] - 10https://gerrit.wikimedia.org/r/364682 (https://phabricator.wikimedia.org/T170298) (owner: 10Muehlenhoff) [12:07:49] (03PS1) 10Andrew Bogott: Nova: Remove redundant logrotate script that's causing cronspam [puppet] - 10https://gerrit.wikimedia.org/r/364702 [12:08:45] (03CR) 10jerkins-bot: [V: 04-1] Nova: Remove redundant logrotate script that's causing cronspam [puppet] - 10https://gerrit.wikimedia.org/r/364702 (owner: 10Andrew Bogott) [12:10:13] (03CR) 10Dereckson: [C: 031] Add maiwikimedia to Apache conf [puppet] - 10https://gerrit.wikimedia.org/r/361296 (https://phabricator.wikimedia.org/T168782) (owner: 10Urbanecm) [12:10:18] Could I get ops assistance to merge this change please (adding mai.wikimedia to Apache)? ^ [12:11:23] (03PS2) 10Andrew Bogott: Nova: Remove redundant logrotate script that's causing cronspam [puppet] - 10https://gerrit.wikimedia.org/r/364702 [12:13:36] (03PS2) 10Dereckson: Upload logos for maiwikimedia, add them to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363176 (https://phabricator.wikimedia.org/T168782) (owner: 10Urbanecm) [12:13:42] (03CR) 10Andrew Bogott: [C: 032] Nova: Remove redundant logrotate script that's causing cronspam [puppet] - 10https://gerrit.wikimedia.org/r/364702 (owner: 10Andrew Bogott) [12:13:46] (03CR) 10Dereckson: [C: 032] Upload logos for maiwikimedia, add them to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363176 (https://phabricator.wikimedia.org/T168782) (owner: 10Urbanecm) [12:14:01] !log upgrade nginx on thumbor and prometheus machines [12:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:16] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:14:27] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:14:41] (03Merged) 10jenkins-bot: Upload logos for maiwikimedia, add them to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363176 (https://phabricator.wikimedia.org/T168782) (owner: 10Urbanecm) [12:15:26] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:26] PROBLEM - salt-minion processes on ms-be1034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [12:15:46] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:46] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:56] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:56] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:15:56] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:16:14] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Create fishbowl wiki for Maithili Wikimedians User Group - https://phabricator.wikimedia.org/T168782#3430155 (10Dereckson) [12:16:16] (03CR) 10jenkins-bot: Upload logos for maiwikimedia, add them to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363176 (https://phabricator.wikimedia.org/T168782) (owner: 10Urbanecm) [12:16:17] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:16:17] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:16:37] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:17:16] RECOVERY - MariaDB Slave SQL: s4 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:17:16] RECOVERY - MariaDB Slave SQL: s6 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:17:16] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:17:17] PROBLEM - DPKG on prometheus2004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:17:36] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [12:17:36] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:17:46] RECOVERY - MariaDB Slave SQL: m2 on dbstore1001 is OK: OK slave_sql_state not a slave [12:17:46] RECOVERY - MariaDB Slave IO: s2 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [12:17:47] RECOVERY - MariaDB Slave SQL: s1 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [12:17:47] RECOVERY - MariaDB Slave SQL: m3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: No, (no error: intentional) [12:18:22] !log dereckson@tin Synchronized static/images/project-logos: Logos for mai.wikimedia (T168782) (duration: 00m 46s) [12:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:32] T168782: Create fishbowl wiki for Maithili Wikimedians User Group - https://phabricator.wikimedia.org/T168782 [12:19:09] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Create fishbowl wiki for Maithili Wikimedians User Group - https://phabricator.wikimedia.org/T168782#3430182 (10Dereckson) [12:20:54] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Logos for mai.wikimedia (T168782) (duration: 00m 46s) [12:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:16] PROBLEM - puppet last run on prometheus2004 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): Package[prometheus-node-exporter],Package[nginx-full],Package[nginx-common],Package[puppet] [12:24:03] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364683 [12:25:45] Dereckson: there is an untracked typescript file on /srv/mediawiki-staging is that yours? [12:26:36] !log reimage mw1260 (video scaler) to jessie [12:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:36] RECOVERY - DPKG on prometheus2004 is OK: All packages OK [12:32:10] (03PS3) 10Dereckson: Set initial configuration for techconduct.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354985 (https://phabricator.wikimedia.org/T165977) [12:32:31] (03CR) 10Dereckson: "PS3: +logos, +wikiversion" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354985 (https://phabricator.wikimedia.org/T165977) (owner: 10Dereckson) [12:32:35] (03CR) 10Dereckson: [C: 032] Set initial configuration for techconduct.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354985 (https://phabricator.wikimedia.org/T165977) (owner: 10Dereckson) [12:32:59] marostegui: yes, cleaning it [12:33:10] thanks! [12:33:20] done [12:33:31] (03Merged) 10jenkins-bot: Set initial configuration for techconduct.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354985 (https://phabricator.wikimedia.org/T165977) (owner: 10Dereckson) [12:33:36] PROBLEM - DPKG on bast3002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:33:42] (03CR) 10jenkins-bot: Set initial configuration for techconduct.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354985 (https://phabricator.wikimedia.org/T165977) (owner: 10Dereckson) [12:34:07] marostegui: you've a pooling change to deploy? [12:34:22] (03PS1) 10Andrew Bogott: Labs puppetmaster: Add some conftool stuff to labs/puppetmaster/frontend.yaml [puppet] - 10https://gerrit.wikimedia.org/r/364705 [12:34:36] RECOVERY - DPKG on bast3002 is OK: All packages OK [12:34:38] Dereckson: I can wait :) [12:34:58] (03PS1) 10Alexandros Kosiaris: Fix RSpec for monitoring module [puppet] - 10https://gerrit.wikimedia.org/r/364706 [12:35:00] (03PS1) 10Alexandros Kosiaris: monitoring: Simplify check_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/364707 [12:35:02] (03PS1) 10Alexandros Kosiaris: WIP: Monitoring: Simplify BMC hostnames [puppet] - 10https://gerrit.wikimedia.org/r/364708 [12:35:07] PROBLEM - Check systemd state on bast3002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:37:00] (03CR) 10Andrew Bogott: [C: 032] Labs puppetmaster: Add some conftool stuff to labs/puppetmaster/frontend.yaml [puppet] - 10https://gerrit.wikimedia.org/r/364705 (owner: 10Andrew Bogott) [12:37:28] marostegui: by the way, if you've 5 minutes free, could you update Apache Puppet configuration per https://wikitech.wikimedia.org/wiki/Add_a_wiki#Apache_configuration? I've created mai.wikimedia but I then noticed it was only added to DNS, not yet Apache. Change to merge is https://gerrit.wikimedia.org/r/#/c/361296/ [12:38:21] (03PS1) 10Dereckson: Revert "Set initial configuration for techconduct.wikimedia.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364709 (https://phabricator.wikimedia.org/T165977) [12:38:29] I've the same problem with techconduct but I'll postpone it [12:39:00] (03CR) 10Dereckson: [C: 032] Revert "Set initial configuration for techconduct.wikimedia.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364709 (https://phabricator.wikimedia.org/T165977) (owner: 10Dereckson) [12:39:19] (03PS2) 10Alexandros Kosiaris: Fix RSpec for monitoring module [puppet] - 10https://gerrit.wikimedia.org/r/364706 [12:39:26] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix RSpec for monitoring module [puppet] - 10https://gerrit.wikimedia.org/r/364706 (owner: 10Alexandros Kosiaris) [12:39:46] (03CR) 10Alexandros Kosiaris: [C: 032] monitoring: Simplify check_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/364707 (owner: 10Alexandros Kosiaris) [12:39:52] (03PS2) 10Alexandros Kosiaris: monitoring: Simplify check_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/364707 [12:39:54] (03Merged) 10jenkins-bot: Revert "Set initial configuration for techconduct.wikimedia.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364709 (https://phabricator.wikimedia.org/T165977) (owner: 10Dereckson) [12:39:56] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] monitoring: Simplify check_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/364707 (owner: 10Alexandros Kosiaris) [12:40:15] (03CR) 10jenkins-bot: Revert "Set initial configuration for techconduct.wikimedia.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364709 (https://phabricator.wikimedia.org/T165977) (owner: 10Dereckson) [12:41:34] 10Operations, 10DBA, 10Wikimedia-Site-requests, 10Patch-For-Review: Create CoC committee private wiki - https://phabricator.wikimedia.org/T165977#3430212 (10Dereckson) Wiki creation is postponed. Next step is to merge Apache configuration — https://gerrit.wikimedia.org/r/354959 [12:43:41] (03PS1) 10Dereckson: Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364713 [12:43:59] (03CR) 10Dereckson: [C: 032] Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364713 (owner: 10Dereckson) [12:44:29] PROBLEM - Confd template for /srv/config-master/pybal/codfw/citoid on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:44:30] PROBLEM - Confd template for /srv/config-master/pybal/codfw/misc_web-https on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:44:30] PROBLEM - Confd template for /srv/config-master/pybal/codfw/swift on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:44:30] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/apertium on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:44:30] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/git-ssh on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:44:30] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-udp2log on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:44:30] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/upload-https on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:44:31] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/rendering-https on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:44:31] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/text on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:44:50] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:45:29] PROBLEM - Confd template for /srv/config-master/pybal/codfw/cxserver on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:45:29] PROBLEM - Check systemd state on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:45:29] PROBLEM - Confd template for /srv/config-master/pybal/codfw/mobileapps on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:45:29] PROBLEM - Confd template for /srv/config-master/pybal/codfw/swift-https on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:45:29] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/api on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:45:29] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/graphoid on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:45:29] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/restbase on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:45:30] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/mathoid on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:45:30] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/wdqs on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:45:31] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/text-https on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:09] PROBLEM - Check the NTP synchronisation status of timesyncd on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:11] PROBLEM - Confd template for /srv/config-master/pybal/codfw/text on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:11] PROBLEM - Confd template for /srv/config-master/pybal/codfw/dns_rec on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:11] PROBLEM - Confd template for /srv/config-master/pybal/codfw/ores on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:11] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kartotherian on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:11] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/misc_web on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:11] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/api-https on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:11] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/search on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:11] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/upload on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:12] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/zotero on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:00] PROBLEM - Confd template for /srv/config-master/pybal/codfw/dns_rec_udp on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:00] PROBLEM - Confd template for /srv/config-master/pybal/codfw/parsoid on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:00] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/appservers-https on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:00] PROBLEM - Confd template for /srv/config-master/pybal/codfw/text-https on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:00] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/search-https on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:00] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kibana on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:00] PROBLEM - Confd template for /srv/config-master/pybal/ulsfo/upload-https on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:01] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/misc_web-https on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:01] PROBLEM - Confd template for /srv/config-master/pybal/esams/dns_rec on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:06] (03Merged) 10jenkins-bot: Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364713 (owner: 10Dereckson) [12:47:14] (03CR) 10jenkins-bot: Update interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364713 (owner: 10Dereckson) [12:47:59] PROBLEM - Confd template for /srv/config-master/discovery/discovery-basic.yaml on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:59] PROBLEM - Confd template for /srv/config-master/pybal/codfw/pdfrender on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:59] PROBLEM - Confd template for /srv/config-master/pybal/codfw/eventbus on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:59] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/mobileapps on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:59] PROBLEM - Confd template for /srv/config-master/pybal/codfw/thumbor on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:59] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/swift on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:59] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/aqs on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:48:00] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/kubemaster on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:48:00] PROBLEM - Confd template for /srv/config-master/pybal/esams/dns_rec_udp on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:48:35] labtestpuppetmaster2001 reimaged or similar? Cc andrewbogott madhuvishy [12:48:49] PROBLEM - Confd template for /srv/config-master/discovery/services.yaml on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:48:49] PROBLEM - Confd template for /srv/config-master/pybal/codfw/eventstreams on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:48:49] PROBLEM - Confd template for /srv/config-master/pybal/codfw/prometheus on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:48:49] PROBLEM - Confd template for /srv/config-master/pybal/codfw/trendingedits on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:48:49] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-gelf on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:48:50] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/citoid on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:48:50] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ocg on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:48:51] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/swift-https on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:48:51] PROBLEM - confd service on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:48:52] PROBLEM - Confd template for /srv/config-master/pybal/esams/misc_web on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:49:00] RECOVERY - puppet last run on prometheus2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:49:40] PROBLEM - Confd template for /srv/config-master/pybal/codfw/graphoid on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:49:40] PROBLEM - Confd template for /srv/config-master/pybal/codfw/apaches on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:49:49] PROBLEM - Confd template for /srv/config-master/pybal/codfw/rendering on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:49:49] PROBLEM - Confd template for /srv/config-master/pybal/codfw/upload on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:49:49] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-json-tcp on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:49:49] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/cxserver on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:49:49] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/ores on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:49:49] PROBLEM - Confd template for /srv/config-master/pybal/esams/misc_web-https on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:49:49] PROBLEM - puppetmaster backend https on labtestpuppetmaster2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:49:50] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/text on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:49:54] uptime 0d 15h 30m 38s from icinga, going to downtime it [12:50:06] dammit, what's it take to shut that up? [12:50:15] The whole box and all services are downtimed until 2019 [12:50:20] * andrewbogott looks at icinga yet again [12:50:38] just downtimed it for a day [12:50:40] :) [12:51:02] thanks [12:51:11] any idea what happened? I downtimed it before starting any of this [12:51:27] andrewbogott: join the club [12:51:31] there is an outstanding issue with icinga loosing downtimes [12:51:36] 'schedule downtime for checked host and all services' [12:51:39] yeah, ok [12:51:41] :( [12:51:54] https://phabricator.wikimedia.org/T164206 [12:54:00] !log Deploy alter table s1 - db1095 - T166204 [12:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:09] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [12:54:42] !log dereckson@tin Synchronized wmf-config/interwiki.php: Interwiki map update (duration: 00m 46s) [12:54:46] marostegui: I'm done on Tin [12:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:24] (03PS3) 10Aude: Temporarily set $wgPropertySuggesterClassifyingPropertyIds to [ 31 ]. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361851 (https://phabricator.wikimedia.org/T169058) (owner: 10Daniel Kinzler) [12:56:08] Dereckson: thanks [12:56:20] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3430294 (10bd808) Another data point. Alerts starting at 2017-07-12T12:44 for labtestpuppetmaster2001 which @Andrew had marked as downtimed un... [12:58:04] (03PS3) 10Marostegui: Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364683 [12:59:29] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:59:45] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364683 (owner: 10Marostegui) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170712T1300). [13:00:12] o/ [13:00:41] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364683 (owner: 10Marostegui) [13:00:42] aude: deploying your own change, or should I do it? [13:00:53] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1066" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364683 (owner: 10Marostegui) [13:01:38] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1066 - T166204 (duration: 00m 46s) [13:01:42] (03CR) 10Elukey: [C: 032] eventlogging_cleaner.py: fix some runtime issues [puppet] - 10https://gerrit.wikimedia.org/r/364701 (owner: 10Elukey) [13:01:47] zeljkof: i temporarily don't have deployment access so i fyou can, that would be helpful [13:01:47] (03PS2) 10Elukey: eventlogging_cleaner.py: fix some runtime issues [puppet] - 10https://gerrit.wikimedia.org/r/364701 [13:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:49] (03CR) 10Elukey: [V: 032 C: 032] eventlogging_cleaner.py: fix some runtime issues [puppet] - 10https://gerrit.wikimedia.org/r/364701 (owner: 10Elukey) [13:01:51] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [13:02:12] aude: sure [13:02:14] thanks [13:02:22] aude: can you test it at mwdebug? [13:02:26] sure [13:02:59] (03PS1) 10Marostegui: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364716 (https://phabricator.wikimedia.org/T166204) [13:03:42] aude: ok, will ping you in a few minutes when the commit is at mwdebug [13:04:47] ok [13:05:34] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361851 (https://phabricator.wikimedia.org/T169058) (owner: 10Daniel Kinzler) [13:06:28] (03Merged) 10jenkins-bot: Temporarily set $wgPropertySuggesterClassifyingPropertyIds to [ 31 ]. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361851 (https://phabricator.wikimedia.org/T169058) (owner: 10Daniel Kinzler) [13:06:37] (03CR) 10jenkins-bot: Temporarily set $wgPropertySuggesterClassifyingPropertyIds to [ 31 ]. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/361851 (https://phabricator.wikimedia.org/T169058) (owner: 10Daniel Kinzler) [13:06:57] hi ops, thanks for getting that firewall change through so fast! [13:10:23] aude: 361851 is at mwdebug1002, please test and let me know if I can do a full scap [13:11:45] checking [13:12:03] looks ok [13:12:13] the change should be no-op but nothing looks broken [13:13:21] aude: ok, deploying [13:13:58] thanks [13:14:12] !log zfilipin@tin Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:361851|Temporarily set $wgPropertySuggesterClassifyingPropertyIds to [ 31 ]. (T169058)]] (duration: 00m 46s) [13:14:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:24] T169058: Temporarily set $wgPropertySuggesterClassifyingPropertyIds to [ 31 ] for wikidata.org - https://phabricator.wikimedia.org/T169058 [13:14:29] PROBLEM - DPKG on mw1260 is CRITICAL: Return code of 255 is out of bounds [13:14:30] PROBLEM - puppet last run on mw1260 is CRITICAL: Return code of 255 is out of bounds [13:14:41] aude: deployed, please check if things look ok [13:14:54] ok [13:14:57] :D [13:15:09] PROBLEM - HHVM jobrunner on mw1260 is CRITICAL: connect to address 10.64.16.146 and port 9005: Connection refused [13:15:19] PROBLEM - Disk space on mw1260 is CRITICAL: Return code of 255 is out of bounds [13:15:20] PROBLEM - salt-minion processes on mw1260 is CRITICAL: Return code of 255 is out of bounds [13:16:39] seems ok [13:16:39] ah this one is moritzm reimaging the host for sure [13:17:02] aude: great, thanks for deploying with #releng! ;) [13:17:09] !log EU SWAT finished! [13:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:59] PROBLEM - MD RAID on mw1260 is CRITICAL: Return code of 255 is out of bounds [13:18:05] (03PS1) 10Ottomata: Create reportupdater::jobs profiles for stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/364718 (https://phabricator.wikimedia.org/T152712) [13:18:13] (03PS2) 10Marostegui: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364716 (https://phabricator.wikimedia.org/T166204) [13:18:59] thanks [13:19:00] (03CR) 10jerkins-bot: [V: 04-1] Create reportupdater::jobs profiles for stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/364718 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [13:19:16] yep, mw1260 is still being reimaged, I'll silence that again [13:19:27] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364716 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [13:19:29] PROBLEM - puppet last run on mw1252 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[apache2] [13:19:40] PROBLEM - configured eth on mw1260 is CRITICAL: Return code of 255 is out of bounds [13:19:42] (03PS2) 10Ottomata: Create reportupdater::jobs profiles for stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/364718 (https://phabricator.wikimedia.org/T152712) [13:20:23] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364716 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [13:20:32] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1072 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364716 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [13:21:26] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1072 - T166204 (duration: 00m 46s) [13:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:39] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [13:22:00] (03PS1) 10Ottomata: Use stretch for stat1006 [puppet] - 10https://gerrit.wikimedia.org/r/364719 (https://phabricator.wikimedia.org/T152712) [13:23:12] (03PS3) 10Ottomata: Create reportupdater::jobs profiles for stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/364718 (https://phabricator.wikimedia.org/T152712) [13:23:20] !log Deploy alter table s1 - db1072 - T166204 [13:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:48] (03CR) 10Ottomata: [C: 032] Use stretch for stat1006 [puppet] - 10https://gerrit.wikimedia.org/r/364719 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [13:24:10] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 22 probes of 435 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [13:24:10] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 162 probes of 291 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:24:37] the stat* boxes are jumping from trusty to stretch :D [13:25:24] (03PS2) 10Giuseppe Lavagetto: Add entries for service recommendation-api [dns] - 10https://gerrit.wikimedia.org/r/364457 (https://phabricator.wikimedia.org/T165760) [13:25:26] (03PS2) 10Giuseppe Lavagetto: Add discovery DNS entry for service recommendation-api [dns] - 10https://gerrit.wikimedia.org/r/364458 (https://phabricator.wikimedia.org/T165760) [13:25:41] (03CR) 10jerkins-bot: [V: 04-1] Add discovery DNS entry for service recommendation-api [dns] - 10https://gerrit.wikimedia.org/r/364458 (https://phabricator.wikimedia.org/T165760) (owner: 10Giuseppe Lavagetto) [13:29:19] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 435 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [13:31:38] (03PS4) 10Ottomata: Create reportupdater::jobs profiles for stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/364718 (https://phabricator.wikimedia.org/T152712) [13:33:47] (03PS2) 10BBlack: puppet-run: manual splay at the top [puppet] - 10https://gerrit.wikimedia.org/r/364630 [13:34:19] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 292 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:34:43] (03CR) 10BBlack: [C: 032] puppet-run: manual splay at the top [puppet] - 10https://gerrit.wikimedia.org/r/364630 (owner: 10BBlack) [13:34:53] (03CR) 10Ottomata: [C: 032] "Mostly no-op, some requirements have been shuffled, but the result is the same: https://puppet-compiler.wmflabs.org/7022/" [puppet] - 10https://gerrit.wikimedia.org/r/364718 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [13:35:02] (03PS1) 10Andrew Bogott: include 'standard' on labtestpuppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/364721 [13:35:04] (03PS5) 10Ottomata: Create reportupdater::jobs profiles for stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/364718 (https://phabricator.wikimedia.org/T152712) [13:35:08] (03CR) 10Ottomata: [V: 032 C: 032] Create reportupdater::jobs profiles for stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/364718 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [13:35:19] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:35:55] (03PS3) 10BBlack: puppet-run: manual splay at the top [puppet] - 10https://gerrit.wikimedia.org/r/364630 [13:35:57] (03CR) 10BBlack: [V: 032 C: 032] puppet-run: manual splay at the top [puppet] - 10https://gerrit.wikimedia.org/r/364630 (owner: 10BBlack) [13:35:59] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:37:39] ottomata: still reviewing? [13:38:37] (03PS2) 10Andrew Bogott: include 'standard' on labtestpuppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/364721 [13:39:28] (03CR) 10Muehlenhoff: include 'standard' on labtestpuppetmaster2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/364721 (owner: 10Andrew Bogott) [13:39:59] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:40:09] RECOVERY - puppet last run on mw1252 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [13:40:20] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:40:42] bblack: reviewing? [13:40:47] (03CR) 10Andrew Bogott: [C: 032] include 'standard' on labtestpuppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/364721 (owner: 10Andrew Bogott) [13:41:00] oh sorry [13:41:02] got caught [13:41:05] you can merge mine too [13:41:07] there's 3! [13:41:14] I donno about andrewbogott's :) [13:41:18] andrewbogott: ? [13:41:26] yes, please merge mine [13:41:29] merging [13:43:21] (03PS3) 10Giuseppe Lavagetto: recommendation api: refactor profile, remove module [puppet] - 10https://gerrit.wikimedia.org/r/364221 (https://phabricator.wikimedia.org/T148129) [13:43:23] (03PS2) 10Giuseppe Lavagetto: role::scb: add recommendation-api service [puppet] - 10https://gerrit.wikimedia.org/r/364451 (https://phabricator.wikimedia.org/T165760) [13:45:35] (03CR) 10jerkins-bot: [V: 04-1] role::scb: add recommendation-api service [puppet] - 10https://gerrit.wikimedia.org/r/364451 (https://phabricator.wikimedia.org/T165760) (owner: 10Giuseppe Lavagetto) [13:46:18] (03CR) 10Giuseppe Lavagetto: "> Hm, we have a kind of a de facto standard to have a module per" [puppet] - 10https://gerrit.wikimedia.org/r/364221 (https://phabricator.wikimedia.org/T148129) (owner: 10Giuseppe Lavagetto) [13:50:37] dinwiki doesn't let me right now add interwiki links, I guess we still have some wikidata stuff to do? [13:51:47] (03PS4) 10Giuseppe Lavagetto: recommendation api: refactor profile, remove module [puppet] - 10https://gerrit.wikimedia.org/r/364221 (https://phabricator.wikimedia.org/T148129) [13:51:49] (03PS3) 10Giuseppe Lavagetto: role::scb: add recommendation-api service [puppet] - 10https://gerrit.wikimedia.org/r/364451 (https://phabricator.wikimedia.org/T165760) [13:53:47] 10Operations, 10Dumps-Generation: Architecture and puppetize setup for dumpsdata boxes - https://phabricator.wikimedia.org/T169849#3430532 (10ArielGlenn) Due to dump schduling, the code that will take the longest to deploy will be the changesets to the dumps repo; these can only be pushed between dump runs. T... [13:54:03] (03CR) 10Alexandros Kosiaris: [C: 031] ">yes, that is my plan all along. Let's start with this and remove the modules for the others in the next few weeks." [puppet] - 10https://gerrit.wikimedia.org/r/364221 (https://phabricator.wikimedia.org/T148129) (owner: 10Giuseppe Lavagetto) [13:56:48] (03CR) 10Faidon Liambotis: [C: 031] Add runbook link and remove
from Nagios interfaces check messages. [puppet] - 10https://gerrit.wikimedia.org/r/363309 (owner: 10Ayounsi) [13:57:26] (03PS1) 10Ottomata: Refactor geowiki a bit and make a geowiki profile [puppet] - 10https://gerrit.wikimedia.org/r/364727 (https://phabricator.wikimedia.org/T152712) [13:57:33] (03Abandoned) 10Filippo Giunchedi: graphite: keep 'servers' hierarchy for 60d [puppet] - 10https://gerrit.wikimedia.org/r/364687 (https://phabricator.wikimedia.org/T169972) (owner: 10Filippo Giunchedi) [13:58:20] (03CR) 10jerkins-bot: [V: 04-1] Refactor geowiki a bit and make a geowiki profile [puppet] - 10https://gerrit.wikimedia.org/r/364727 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [13:58:26] (03PS1) 10ArielGlenn: write out a list of special dump files per dump run that downloaders may want [dumps] - 10https://gerrit.wikimedia.org/r/364729 (https://phabricator.wikimedia.org/T169849) [13:59:15] (03PS2) 10Ottomata: Refactor geowiki a bit and make a geowiki profile [puppet] - 10https://gerrit.wikimedia.org/r/364727 (https://phabricator.wikimedia.org/T152712) [14:00:13] ottomata: (not sure if finished, but ::profile::geowiki seems missing from --^) [14:00:23] OO! [14:00:25] thanks :o [14:00:56] (03PS3) 10Ottomata: Refactor geowiki a bit and make a geowiki profile [puppet] - 10https://gerrit.wikimedia.org/r/364727 (https://phabricator.wikimedia.org/T152712) [14:01:09] elukey: i know we said we weren't going to do this little bit of profile stuff for stat, but it was getting hard to manage piecemeal transitions [14:01:22] i want to see if i can move the jobs over to the new boxes before trying to get the people part over [14:01:27] ottomata: I saw the code reviews and they make sense! [14:01:31] :) [14:01:49] i'm going to rename the current stat roles to ::old or something, or make new ones for the new boxes [14:01:58] that way we can move stuff selectively from the old roles to the new ones [14:02:05] and then eventually get rid of the old ones [14:02:38] super [14:02:50] PROBLEM - MariaDB Slave Lag: x1 on db2033 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 317.58 seconds [14:03:37] ^ checking [14:04:39] Looks like it was semi sync [14:04:50] RECOVERY - MariaDB Slave Lag: x1 on db2033 is OK: OK slave_sql_lag Replication lag: 6.42 seconds [14:05:55] (03PS4) 10Ottomata: Refactor geowiki a bit and make a geowiki profile [puppet] - 10https://gerrit.wikimedia.org/r/364727 (https://phabricator.wikimedia.org/T152712) [14:07:00] !log Run redact_sanitarium on db1069 for dinwiki - T169193 [14:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:14] T169193: Prepare and check storage layer for dinwiki - https://phabricator.wikimedia.org/T169193 [14:08:34] (03CR) 10Ottomata: [C: 032] "No op https://puppet-compiler.wmflabs.org/7027/stat1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/364727 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [14:09:13] (03PS4) 10Giuseppe Lavagetto: role::scb: add recommendation-api service [puppet] - 10https://gerrit.wikimedia.org/r/364451 (https://phabricator.wikimedia.org/T165760) [14:11:13] PROBLEM - Confd template for /srv/config-master/pybal/codfw/restbase on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:13] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/dns_rec_udp on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:13] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-log4j on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:13] PROBLEM - Confd template for /srv/config-master/pybal/codfw/wdqs on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:13] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/pdfrender on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:14] PROBLEM - Confd template for /srv/config-master/pybal/esams/text-https on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:14] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-json-udp on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:15] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/dns_rec on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:15] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/parsoid on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:16] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/thumbor on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:16] PROBLEM - Confd template for /srv/config-master/pybal/codfw/upload-https on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:17] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/text-https on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:17] PROBLEM - Confd template for /srv/config-master/pybal/codfw/apertium on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:18] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kartotherian on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:33] PROBLEM - swift-object-replicator on ms-be1036 is CRITICAL: Return code of 255 is out of bounds [14:11:43] PROBLEM - swift-container-auditor on ms-be1036 is CRITICAL: Return code of 255 is out of bounds [14:11:43] PROBLEM - swift-container-replicator on ms-be1036 is CRITICAL: Return code of 255 is out of bounds [14:11:43] PROBLEM - swift-object-server on ms-be1036 is CRITICAL: Return code of 255 is out of bounds [14:11:43] PROBLEM - nutcracker process on mw1260 is CRITICAL: Return code of 255 is out of bounds [14:11:43] PROBLEM - MD RAID on ms-be1036 is CRITICAL: Return code of 255 is out of bounds [14:11:43] PROBLEM - Confd template for /srv/config-master/pybal/codfw/search-https on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:44] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/apaches on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:44] PROBLEM - swift-account-reaper on ms-be1036 is CRITICAL: Return code of 255 is out of bounds [14:11:45] PROBLEM - swift-container-server on ms-be1036 is CRITICAL: Return code of 255 is out of bounds [14:11:45] PROBLEM - Confd template for /srv/config-master/pybal/codfw/misc_web on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:46] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/rendering on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:46] PROBLEM - Confd template for /srv/config-master/pybal/codfw/appservers-https on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:47] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/eventstreams on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:47] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/logstash-syslog-udp on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:11:58] ugh, ms-be1036 is being wmf-auto-reimage'd [14:12:03] PROBLEM - Confd template for /srv/config-master/pybal/codfw/api on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:12:03] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kubemaster on labtestpuppetmaster2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:12:23] RECOVERY - very high load average likely xfs on ms-be1036 is OK: OK - load average: 3.06, 1.03, 0.38 [14:12:24] RECOVERY - DPKG on ms-be1036 is OK: All packages OK [14:12:24] RECOVERY - Check size of conntrack table on ms-be1036 is OK: OK: nf_conntrack is 0 % full [14:12:33] RECOVERY - swift-object-updater on ms-be1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [14:12:33] RECOVERY - swift-object-replicator on ms-be1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [14:12:43] RECOVERY - swift-container-auditor on ms-be1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:12:43] RECOVERY - swift-object-server on ms-be1036 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [14:12:43] RECOVERY - swift-container-replicator on ms-be1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [14:12:44] RECOVERY - MD RAID on ms-be1036 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [14:12:44] RECOVERY - swift-account-reaper on ms-be1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [14:12:44] RECOVERY - swift-container-server on ms-be1036 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [14:12:53] RECOVERY - swift-account-auditor on ms-be1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [14:12:53] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: Return code of 255 is out of bounds [14:13:03] RECOVERY - swift-object-auditor on ms-be1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [14:13:04] RECOVERY - swift-account-server on ms-be1036 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [14:13:04] RECOVERY - configured eth on ms-be1036 is OK: OK - interfaces up [14:13:04] RECOVERY - Disk space on ms-be1036 is OK: DISK OK [14:13:13] RECOVERY - swift-account-replicator on ms-be1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [14:13:37] (03CR) 10Giuseppe Lavagetto: [C: 031] "https://puppet-compiler.wmflabs.org/7028/" [puppet] - 10https://gerrit.wikimedia.org/r/364451 (https://phabricator.wikimedia.org/T165760) (owner: 10Giuseppe Lavagetto) [14:14:00] (03PS5) 10Giuseppe Lavagetto: recommendation api: refactor profile, remove module [puppet] - 10https://gerrit.wikimedia.org/r/364221 (https://phabricator.wikimedia.org/T148129) [14:14:38] !log Disable icinga notifications and event handler checks for labsdb1005 [14:14:46] (03CR) 10Giuseppe Lavagetto: [C: 032] recommendation api: refactor profile, remove module [puppet] - 10https://gerrit.wikimedia.org/r/364221 (https://phabricator.wikimedia.org/T148129) (owner: 10Giuseppe Lavagetto) [14:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:21] thanks madhuvishy for the handler too!! ;) [14:16:53] PROBLEM - salt-minion processes on stat1006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:17:00] (03CR) 10Addshore: Add Newsletter to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362393 (https://phabricator.wikimedia.org/T110170) (owner: 10Addshore) [14:17:01] volans: :) [14:17:07] ok, I will reboot now [14:17:17] jynus: okay I've disabled notifications for labsdb1005 and toolschecker [14:17:19] anything else? [14:17:40] no, we are good [14:17:43] PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:17:51] !log restarting labsdb1005 (toolsdb) [14:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:28] it may take some time for mysql to go down [14:19:00] jynus: okay, I'm at the console [14:19:08] (03PS2) 10Addshore: Add Newsletter to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362393 (https://phabricator.wikimedia.org/T110170) [14:19:23] rebooting now [14:19:36] 10Operations, 10Performance-Team, 10Thumbor: Write graceful rolling restart script for Thumbor - https://phabricator.wikimedia.org/T162875#3430665 (10Gilles) 05Open>03Invalid We can and should just depool a server before restarting all its Thumbor instances. [14:21:33] PROBLEM - mediawiki-installation DSH group on mw1260 is CRITICAL: Host mw1260 is not in mediawiki-installation dsh group [14:21:43] (03PS2) 10Addshore: Enable Newsletter on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362394 (https://phabricator.wikimedia.org/T110170) [14:21:45] (03PS1) 10Addshore: Enable Newsletter on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364734 (https://phabricator.wikimedia.org/T110170) [14:21:53] (03CR) 10jerkins-bot: [V: 04-1] Enable Newsletter on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362394 (https://phabricator.wikimedia.org/T110170) (owner: 10Addshore) [14:21:57] (03CR) 10jerkins-bot: [V: 04-1] Enable Newsletter on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364734 (https://phabricator.wikimedia.org/T110170) (owner: 10Addshore) [14:21:57] jynus: it's up [14:22:11] (03PS3) 10Addshore: Enable Newsletter on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362394 (https://phabricator.wikimedia.org/T110170) [14:22:15] (03PS2) 10Addshore: Enable Newsletter on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364734 (https://phabricator.wikimedia.org/T110170) [14:22:43] service is up [14:22:48] but I am running mysql_upgrade [14:22:53] jynus: okay [14:23:22] with aria/myisam tables, we cannot guarantee there is not user tables corrupted [14:23:24] (03CR) 10Addshore: [C: 032] Add Newsletter to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362393 (https://phabricator.wikimedia.org/T110170) (owner: 10Addshore) [14:23:41] not becuse the reboot, but because they were already there [14:24:04] jynus: user tables as in, host specific ones? [14:24:22] (03Merged) 10jenkins-bot: Add Newsletter to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362393 (https://phabricator.wikimedia.org/T110170) (owner: 10Addshore) [14:24:35] user-created tables [14:25:39] everything looks good [14:26:05] jynus: awesome, verifying on tools end [14:26:11] (03CR) 10jenkins-bot: Add Newsletter to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362393 (https://phabricator.wikimedia.org/T110170) (owner: 10Addshore) [14:26:23] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1260 is CRITICAL: Return code of 255 is out of bounds [14:26:33] RECOVERY - puppet last run on ms-be1036 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [14:26:57] labsdb1004 replicationg with no problem [14:27:04] we are not on 10.0.31 on both servers [14:27:09] !log addshore@tin Synchronized wmf-config/extension-list: [[gerrit:362393|Add Newsletter to extension-list]] PT1/2 (duration: 00m 47s) [14:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:00] performance my be degraded for some time, but probably nothing too worrying [14:28:08] !log addshore@tin Synchronized wmf-config/extension-list-labs: [[gerrit:362393|Add Newsletter to extension-list]] PT1/2 (duration: 00m 46s) [14:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:01] jynus: I can confirm i can access it fine from tools [14:31:33] PROBLEM - Host labtestpuppetmaster2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:31:45] (03CR) 10Faidon Liambotis: [C: 031] WIP: Monitoring: Simplify BMC hostnames [puppet] - 10https://gerrit.wikimedia.org/r/364708 (owner: 10Alexandros Kosiaris) [14:31:54] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/swift-https on labtestpuppetmaster2001 is OK: No errors detected [14:32:03] RECOVERY - Host labtestpuppetmaster2001 is UP: PING OK - Packet loss = 0%, RTA = 36.30 ms [14:32:04] RECOVERY - Confd template for /srv/config-master/pybal/codfw/parsoid on labtestpuppetmaster2001 is OK: No errors detected [14:32:04] RECOVERY - Confd template for /srv/config-master/pybal/codfw/dns_rec_udp on labtestpuppetmaster2001 is OK: No errors detected [14:32:04] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/apaches on labtestpuppetmaster2001 is OK: No errors detected [14:32:04] RECOVERY - puppetmaster https on labtestpuppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 331 bytes in 0.208 second response time [14:32:04] RECOVERY - Confd template for /srv/config-master/pybal/codfw/search-https on labtestpuppetmaster2001 is OK: No errors detected [14:32:04] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/misc_web-https on labtestpuppetmaster2001 is OK: No errors detected [14:32:05] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/appservers-https on labtestpuppetmaster2001 is OK: No errors detected [14:32:05] RECOVERY - Confd template for /srv/config-master/pybal/esams/dns_rec on labtestpuppetmaster2001 is OK: No errors detected [14:32:06] RECOVERY - Confd template for /srv/config-master/pybal/ulsfo/upload-https on labtestpuppetmaster2001 is OK: No errors detected [14:32:06] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/kubemaster on labtestpuppetmaster2001 is OK: No errors detected [14:32:07] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/kibana on labtestpuppetmaster2001 is OK: No errors detected [14:32:07] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/search-https on labtestpuppetmaster2001 is OK: No errors detected [14:32:08] RECOVERY - Confd template for /srv/config-master/pybal/codfw/text-https on labtestpuppetmaster2001 is OK: No errors detected [14:32:19] RECOVERY - Confd template for /srv/config-master/pybal/codfw/dns_rec on labtestpuppetmaster2001 is OK: No errors detected [14:32:19] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/api-https on labtestpuppetmaster2001 is OK: No errors detected [14:32:20] RECOVERY - Confd template for /srv/config-master/pybal/codfw/zotero on labtestpuppetmaster2001 is OK: No errors detected [14:32:20] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/zotero on labtestpuppetmaster2001 is OK: No errors detected [14:32:21] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/prometheus on labtestpuppetmaster2001 is OK: No errors detected [14:32:21] RECOVERY - Confd template for /srv/config-master/pybal/codfw/search on labtestpuppetmaster2001 is OK: No errors detected [14:32:23] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/trendingedits on labtestpuppetmaster2001 is OK: No errors detected [14:32:23] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/eventbus on labtestpuppetmaster2001 is OK: No errors detected [14:32:23] RECOVERY - Confd template for /srv/config-master/pybal/codfw/mathoid on labtestpuppetmaster2001 is OK: No errors detected [14:32:23] RECOVERY - Confd template for /srv/config-master/pybal/esams/upload on labtestpuppetmaster2001 is OK: No errors detected [14:32:24] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/misc_web on labtestpuppetmaster2001 is OK: No errors detected [14:32:24] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/logstash-syslog-tcp on labtestpuppetmaster2001 is OK: No errors detected [14:32:28] <_joe_> what? [14:32:33] RECOVERY - Confd template for /srv/config-master/discovery/services.yaml on labtestpuppetmaster2001 is OK: No errors detected [14:32:33] RECOVERY - Confd template for /srv/config-master/pybal/codfw/eventstreams on labtestpuppetmaster2001 is OK: No errors detected [14:32:33] RECOVERY - Confd template for /srv/config-master/pybal/codfw/prometheus on labtestpuppetmaster2001 is OK: No errors detected [14:32:33] RECOVERY - Confd template for /srv/config-master/pybal/codfw/eventbus on labtestpuppetmaster2001 is OK: No errors detected [14:32:33] RECOVERY - Confd template for /srv/config-master/pybal/codfw/graphoid on labtestpuppetmaster2001 is OK: No errors detected [14:32:33] RECOVERY - Confd template for /srv/config-master/pybal/codfw/trendingedits on labtestpuppetmaster2001 is OK: No errors detected [14:32:33] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/logstash-gelf on labtestpuppetmaster2001 is OK: No errors detected [14:32:34] RECOVERY - Confd template for /srv/config-master/pybal/codfw/upload on labtestpuppetmaster2001 is OK: No errors detected [14:32:34] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/citoid on labtestpuppetmaster2001 is OK: No errors detected [14:32:35] RECOVERY - Check size of conntrack table on labtestpuppetmaster2001 is OK: OK: nf_conntrack is 0 % full [14:32:35] RECOVERY - Confd template for /srv/config-master/pybal/codfw/rendering on labtestpuppetmaster2001 is OK: No errors detected [14:32:36] RECOVERY - Confd template for /srv/config-master/pybal/esams/misc_web on labtestpuppetmaster2001 is OK: No errors detected [14:32:40] andrewbogott: how many checks there are for labtestpuppetmaster2001??? [14:33:03] ok, apparently I really don't know how to downtime a server [14:33:06] what am I doing wrong? [14:33:31] a lot of those checks seem really odd to me, so...two issues [14:33:38] one check per file? :D [14:33:49] <_joe_> volans: chasemp wrote that check [14:33:58] <_joe_> and yes, it's one per file, it's expected. [14:34:16] why is confd interested in labtestpuppetmaster? [14:34:20] <_joe_> we might want not to have that on labs puppet masters [14:34:28] I'm selecting labtestpuppetmaster2001 in icinga UI and selecting 'downtime host and all services' [14:34:33] <_joe_> chasemp: I guess it came as part of profile::puppetmaster::frontend [14:34:42] chasemp: confd is on purpose, I'm trying to reuse the standard puppetmaster classes. [14:35:14] Since we're going to have two, labpuppetmaster1001 and 1002, I'm trying to make it similar to prod puppetmaster1001/1002 [14:35:22] thanks jynus :) Looks good from my end! [14:35:25] <_joe_> andrewbogott: we might want to refactor that in a separate profile [14:35:26] at this scale it's probably reasonable to lint all and then throw a % fail warning heh [14:35:31] But why can't I make icinga SHUT UP about it? [14:35:35] andrewbogott: ok I gotcha [14:35:55] _joe_: can you tell me more about why I would or wouldn't want the confd parts? [14:36:08] (03PS1) 10Ottomata: Add statistics private and cruncher profiles, with a little refactoring [puppet] - 10https://gerrit.wikimedia.org/r/364741 (https://phabricator.wikimedia.org/T152712) [14:36:21] <_joe_> andrewbogott: because those are just to show values in etcd via config-master.wikimedia.org [14:36:25] (03PS2) 10Ottomata: Add statistics private and cruncher profiles, with a little refactoring [puppet] - 10https://gerrit.wikimedia.org/r/364741 (https://phabricator.wikimedia.org/T152712) [14:36:36] <_joe_> and you're not interested in that on a labs puppetmaster frontend [14:36:42] _joe_: oh, nothing about coordinating between the frontend and backend? [14:36:56] <_joe_> nope [14:37:43] (03CR) 10jerkins-bot: [V: 04-1] Add statistics private and cruncher profiles, with a little refactoring [puppet] - 10https://gerrit.wikimedia.org/r/364741 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [14:37:54] !log installing apache security updates on californium / horizon.wikimedia.org [14:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:12] moritzm: that's also toolsadmin.wikimedia.org just fyi [14:38:24] I'm still pretty worried about how apparently I don't know how to downtime a host despite having done it 1000 times [14:38:46] andrewbogott: recoveries always come even if in downtime if the failure happened outside of a downtime iirc [14:39:05] the fact that it was a ton just made it messy [14:39:10] that's not what happened… I downtimed it for 4 years [14:39:15] and then an hour later I rebooted the host [14:39:16] oh, yeah [14:39:24] I asumed it was teh icinga forgets downtimes issue [14:39:35] if so it forgot it twice in one day [14:39:35] <_joe_> chasemp: you are correct sir [14:39:38] (03PS3) 10Ottomata: Add statistics private and cruncher profiles, with a little refactoring [puppet] - 10https://gerrit.wikimedia.org/r/364741 (https://phabricator.wikimedia.org/T152712) [14:39:42] 8 [14:39:42] since I already downtimed it yesterday for three years [14:40:05] and then in response to an alert storm today when I was working on puppet I downtimed it again [14:40:42] andrewbogott: i don't want to alarm you, but you are dreaming. go turn off the stove. also it's 2019 already. [14:40:58] your house is on fire ! [14:41:06] was for the last 2 years at least [14:41:08] :P [14:41:11] (03PS2) 10Alexandros Kosiaris: Monitoring: Simplify BMC hostnames [puppet] - 10https://gerrit.wikimedia.org/r/364708 [14:41:15] * andrewbogott sighs and goes back to his original problem [14:41:40] but yes we seem to have a problem with icinga losing downtimes and we can't yet pinpoint it [14:41:52] we can't even reproduce it [14:46:46] 3 email von dir persönlich an foerderbarometer@wikimedia.de [14:47:01] strike that [14:47:03] (03PS1) 10Elukey: eventlogging_cleaner.py: configure configparser [puppet] - 10https://gerrit.wikimedia.org/r/364743 [14:47:42] moritzm: all seems well w/ californium afaict [14:48:21] yep, all fine [14:49:05] (03PS1) 10Jcrespo: mariadb-multiinstance: Make the main multisource 3306 instance available [puppet] - 10https://gerrit.wikimedia.org/r/364744 (https://phabricator.wikimedia.org/T169510) [14:49:46] (03CR) 10Elukey: [C: 032] eventlogging_cleaner.py: configure configparser [puppet] - 10https://gerrit.wikimedia.org/r/364743 (owner: 10Elukey) [14:49:54] (03PS1) 10Andrew Bogott: Add ipv6 to labtestpuppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/364745 [14:50:33] (03PS2) 10Jcrespo: mariadb-multiinstance: Make the main multisource 3306 instance available [puppet] - 10https://gerrit.wikimedia.org/r/364744 (https://phabricator.wikimedia.org/T169510) [14:51:36] (03PS2) 10Andrew Bogott: Add ipv6 to labtestpuppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/364745 [14:53:01] (03CR) 10Andrew Bogott: [C: 032] Add ipv6 to labtestpuppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/364745 (owner: 10Andrew Bogott) [14:54:41] (03PS3) 10Jcrespo: mariadb-multiinstance: Make the main multisource 3306 instance available [puppet] - 10https://gerrit.wikimedia.org/r/364744 (https://phabricator.wikimedia.org/T169510) [14:55:20] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops: codfw: rack frack refresh equipment - https://phabricator.wikimedia.org/T169643#3430769 (10ayounsi) Thank you! For the naming, please use: pfw3a-codfw and pfw3b-codfw for the firewalls and fasw-c8a-codfw and fasw-c8b-codfw for the switch members... [14:55:53] !log Run redact_sanitarium on db1069 and db1095 for maiwikimedia - T169510 [14:56:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:03] T169510: Setup dbstore2002 with 2 new mysql instances from production and enable GTID - https://phabricator.wikimedia.org/T169510 [14:56:08] gah! [14:56:12] wrong ticket [14:56:50] !log Run redact_sanitarium on db1069 and db1095 for maiwikimedia - T168788 [14:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:01] T168788: Prepare and check storage layer for maiwikimedia - https://phabricator.wikimedia.org/T168788 [14:57:05] marostegui: Just to let you know, it is okay that I can't access dinwiki at labs yet? [14:57:08] <_joe_> icinga-wm is not here [14:57:16] (03PS4) 10Ottomata: Add statistics private, cruncher and web profiles, with a little refactoring [puppet] - 10https://gerrit.wikimedia.org/r/364741 (https://phabricator.wikimedia.org/T152712) [14:57:17] <_joe_> but someone broke puppet [14:57:19] Urbanecm: yes it is :) [14:57:22] <_joe_> can anyone check? [14:57:25] <_joe_> I'm on the phone [14:57:42] 10Operations, 10ops-codfw, 10monitoring: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307#3430800 (10Papaul) Hi Papaul, Thank you for contacting the Dell Enterprise Technical Support team. We have opened up a new case # 950882366 for issues reported with –... [14:58:12] <_joe_> Sleeping 35 for random splay [14:58:12] <_joe_> Info: Sleeping for 1793 seconds (splay is enabled) [14:58:13] marostegui: Okay, thank you [14:58:32] <_joe_> bblack: ^^ [14:58:41] <_joe_> bblack: --no-splay [14:58:55] (03CR) 10Alexandros Kosiaris: [C: 032] Monitoring: Simplify BMC hostnames [puppet] - 10https://gerrit.wikimedia.org/r/364708 (owner: 10Alexandros Kosiaris) [14:58:59] (03PS3) 10Alexandros Kosiaris: Monitoring: Simplify BMC hostnames [puppet] - 10https://gerrit.wikimedia.org/r/364708 [14:59:02] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Monitoring: Simplify BMC hostnames [puppet] - 10https://gerrit.wikimedia.org/r/364708 (owner: 10Alexandros Kosiaris) [15:00:05] <_joe_> akosiaris: are you on einsteinium? [15:00:11] yes [15:00:12] <_joe_> can you restart icinga-wm? [15:00:16] yeah [15:00:22] (03PS5) 10Ottomata: Add statistics private, cruncher and web profiles, with a little refactoring [puppet] - 10https://gerrit.wikimedia.org/r/364741 (https://phabricator.wikimedia.org/T152712) [15:00:53] ah it got disconnected [15:01:08] and then ofc it does not try to reconnect [15:04:03] PROBLEM - puppet last run on ms-be2026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:04:03] (03PS1) 10Giuseppe Lavagetto: puppet: add --no-splay [puppet] - 10https://gerrit.wikimedia.org/r/364748 [15:04:12] RECOVERY - puppet last run on lvs2006 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [15:04:39] _joe_: docs report that the default is false for splay :/ [15:04:42] PROBLEM - puppet last run on mw2153 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:04:47] (03CR) 10Filippo Giunchedi: [C: 031] Remove sshd options specific to SSH protocol 1 [puppet] - 10https://gerrit.wikimedia.org/r/364682 (https://phabricator.wikimedia.org/T170298) (owner: 10Muehlenhoff) [15:04:49] (03PS6) 10Ottomata: Add statistics private, cruncher and web profiles, with a little refactoring [puppet] - 10https://gerrit.wikimedia.org/r/364741 (https://phabricator.wikimedia.org/T152712) [15:04:59] <_joe_> volans: which docs? [15:05:03] PROBLEM - puppet last run on db1092 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:05:08] <_joe_> sigh [15:05:26] <_joe_> volans: we added it to run-puppet-agent, if you remember [15:05:26] _joe_: https://docs.puppet.com/puppet/3.8/configuration.html#splay [15:05:39] yes a sleep in seconds between 0 and 60 [15:05:43] PROBLEM - puppet last run on mw2203 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:05:44] (03PS2) 10Faidon Liambotis: icinga: merge routers/switches monitoring groups [puppet] - 10https://gerrit.wikimedia.org/r/364206 (https://phabricator.wikimedia.org/T167279) [15:05:48] (03CR) 10Faidon Liambotis: [C: 032] icinga: merge routers/switches monitoring groups [puppet] - 10https://gerrit.wikimedia.org/r/364206 (https://phabricator.wikimedia.org/T167279) (owner: 10Faidon Liambotis) [15:05:54] * volans reading backlog [15:06:13] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:06:13] RECOVERY - puppet last run on analytics1041 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:06:14] <_joe_> volans: nope, I think it's sleeping between 0 and 3600 [15:06:22] RECOVERY - puppet last run on scb2005 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [15:06:22] <_joe_> Info: Sleeping for 1793 seconds (splay is enabled) [15:06:42] RECOVERY - puppet last run on mw2153 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:06:53] <_joe_> volans: I'll wait for your review [15:06:55] _joe_: splay = true [15:06:57] in the config [15:07:01] !log lvs2006: upgrade pybal to 1.13.7 T82747 T154759 [15:07:05] in the [agent] section [15:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:12] T82747: pybal health checks are ipv4 even for ipv6 vips - https://phabricator.wikimedia.org/T82747 [15:07:12] T154759: Pybal not happy with DNS delays - https://phabricator.wikimedia.org/T154759 [15:07:15] <_joe_> so you think we should remove that? [15:07:18] yeah! [15:07:32] <_joe_> actually it makes sense if the agent runs continuously [15:07:32] PROBLEM - puppet last run on db2061 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:07:43] RECOVERY - puppet last run on elastic1033 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [15:07:43] <_joe_> volans: fact is, that would change behaviour [15:07:47] <_joe_> what I did doesn't [15:07:52] PROBLEM - puppet last run on rdb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:07:57] ok, let's go with that first [15:08:01] <_joe_> so I'd say let's first fix the current issue with --no-splay [15:08:06] <_joe_> then look into what would change [15:08:09] but we might need to have cumin change the file [15:08:12] PROBLEM - puppet last run on notebook1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:08:17] to avoid to wait the real runs [15:08:20] ok [15:08:23] <_joe_> volans: no *eventually* puppet runs [15:08:43] <_joe_> if we want to force it, we can't use run-puppet-agent either [15:08:49] (03CR) 10Volans: [C: 031] "+1 we have splay = true in the config file" [puppet] - 10https://gerrit.wikimedia.org/r/364748 (owner: 10Giuseppe Lavagetto) [15:09:12] PROBLEM - puppet last run on dbproxy1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:09:15] (03PS2) 10Giuseppe Lavagetto: puppet: add --no-splay [puppet] - 10https://gerrit.wikimedia.org/r/364748 [15:09:22] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] puppet: add --no-splay [puppet] - 10https://gerrit.wikimedia.org/r/364748 (owner: 10Giuseppe Lavagetto) [15:09:22] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:09:26] _joe_: I was thinking cumin adding the --no-splay to the puppet-run [15:09:32] (03Abandoned) 10Elukey: Temporary depool ulsfo for network issues [dns] - 10https://gerrit.wikimedia.org/r/363598 (owner: 10Elukey) [15:10:01] but yeah eventually it will run... in the next 2h [15:10:03] (03Abandoned) 10Elukey: Add white-list for EventLogging auto-purging [puppet] - 10https://gerrit.wikimedia.org/r/298721 (https://phabricator.wikimedia.org/T108850) (owner: 10Mforns) [15:10:20] sorry :) [15:10:49] I assumed removing the explicit options would default to no-splay, apparently not! [15:11:02] PROBLEM - puppet last run on mw1283 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:11:12] RECOVERY - puppet last run on mw2223 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [15:11:13] bblack: the default yes, but we have splay = true in the puppet.conf [agent] section [15:11:13] so we can still do a manual run of puppet everywhere, it just needs to be manual with the right arguments via cumin [15:11:17] <_joe_> bblack: it fouled all of us [15:11:20] <_joe_> bblack: yes [15:11:41] <_joe_> let me test this fixed the issue [15:11:43] PROBLEM - puppet last run on mw1297 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:11:47] or have cumin add the --nosplay to the file itself :D [15:12:30] 10Operations, 10Performance-Team, 10Thumbor, 10Patch-For-Review: Implement poolcounter failover in Thumbor - https://phabricator.wikimedia.org/T169312#3430873 (10Gilles) [15:13:26] I can patch the puppet.conf too, but --nosplay in the file as well is safer [15:13:54] yeah joe was unsure of the consequences of changing the conf, so we can evaluate that later without hurry [15:14:02] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [15:14:12] PROBLEM - puppet last run on wtp1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:14:17] <_joe_> we can use run-puppet-agent [15:14:44] <_joe_> actually no, it will wait for the long-splayed thing to run? [15:14:55] <_joe_> oh jeez, thank you puppetlabz [15:15:05] I don't think so, because puppet doesn't acquire locks until after the splay sleep, right? [15:15:11] _joe_: it will work [15:15:14] <_joe_> oh right [15:15:17] (which was the original observation leading to the patch) [15:15:29] <_joe_> volans: didn't you create a script to re-run puppet only if it failed previously [15:15:32] <_joe_> ? [15:15:33] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:15:33] you can use the https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed [15:15:52] PROBLEM - puppet last run on analytics1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:15:57] _joe_: ^^^ [15:16:16] <_joe_> volans: those are remnants [15:16:24] the link [15:16:29] <_joe_> they will keep happening until puppet ran everwhere [15:16:32] for the command [15:16:43] <_joe_> oh yeah I was thinking we need puppet to run everywhere [15:16:56] start with the failed-only first ;) [15:17:03] PROBLEM - puppet last run on db2068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:17:19] <_joe_> ok, it won't do so much good, still - doing [15:17:41] why? [15:18:13] PROBLEM - puppet last run on mw2147 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:18:21] <_joe_> because other hosts will keep failing for ~ 1 hour [15:18:23] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [15:18:32] RECOVERY - puppet last run on notebook1002 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [15:18:37] ACKNOWLEDGEMENT - DNS acamar.mgmt.codfw.wmnet on acamar.mgmt.codfw.wmnet is CRITICAL: DNS CRITICAL - No response from DNS 10.193.1.30 alexandros kosiaris still running puppet [15:18:37] ACKNOWLEDGEMENT - DNS achernar.mgmt.codfw.wmnet on achernar.mgmt.codfw.wmnet is CRITICAL: DNS CRITICAL - No response from DNS 10.193.1.31 alexandros kosiaris still running puppet [15:18:37] ACKNOWLEDGEMENT - DNS analytics1001.mgmt.eqiad.wmnet on analytics1001.mgmt.eqiad.wmnet is CRITICAL: DNS CRITICAL - No response from DNS 10.65.5.180 alexandros kosiaris still running puppet [15:18:37] ACKNOWLEDGEMENT - DNS analytics1002.mgmt.eqiad.wmnet on analytics1002.mgmt.eqiad.wmnet is CRITICAL: DNS CRITICAL - No response from DNS 10.65.5.184 alexandros kosiaris still running puppet [15:18:37] ACKNOWLEDGEMENT - DNS analytics1003.mgmt.eqiad.wmnet on analytics1003.mgmt.eqiad.wmnet is CRITICAL: DNS CRITICAL - No response from DNS 10.65.5.181 alexandros kosiaris still running puppet [15:18:37] <_joe_> icinga? [15:18:42] <_joe_> what's up with icinga? [15:18:56] <_joe_> sigh [15:19:06] <_joe_> icinga-wm is down again [15:19:12] excess flood [15:19:12] better ;-) [15:19:19] race condition is as usual the bug btw [15:19:20] (03PS1) 10Gilles: Upgrade to 1.0 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/364752 (https://phabricator.wikimedia.org/T169312) [15:19:25] should be fixed in the next run [15:19:29] icringa [15:19:38] <_joe_> akosiaris: can you restart the bot again? :P [15:19:53] <_joe_> thanks [15:19:57] I did not have to [15:20:13] RECOVERY - puppet last run on elastic2025 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [15:20:13] _joe_: it was kicked for excess flooding [15:20:15] turns out last time it did not reconnect because it croaked on an ascii conversion [15:20:22] RECOVERY - puppet last run on rdb1003 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:20:22] PROBLEM - puppet last run on mc1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:20:23] RECOVERY - puppet last run on mw2147 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:20:31] (03PS1) 10Faidon Liambotis: (WIP): Add SNMP classes [puppet] - 10https://gerrit.wikimedia.org/r/364753 [15:20:32] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [15:20:32] RECOVERY - puppet last run on mw1283 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [15:20:42] RECOVERY - puppet last run on wtp1046 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [15:20:43] RECOVERY - puppet last run on ms-be2026 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [15:20:47] <_joe_> akosiaris: ahaha [15:20:49] nooooooooooooooooooooo, not snmp!!!! [15:21:01] I did my best to get rid of it from puppet [15:21:03] RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [15:21:12] RECOVERY - puppet last run on db2061 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [15:21:12] RECOVERY - puppet last run on mw1203 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [15:21:22] RECOVERY - puppet last run on db2068 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [15:21:40] jouncebot: next [15:21:40] In 1 hour(s) and 38 minute(s): Special (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170712T1700) [15:22:02] RECOVERY - puppet last run on mw1273 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:22:12] PROBLEM - puppet last run on wtp2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:22:18] _joe_: how are the puppet re-runs going ? [15:22:22] RECOVERY - puppet last run on mc1030 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:22:32] I'm going to try to upgrade the Toolforge Elasticsearch cluster again. That will knock stashbot offline. [15:22:42] RECOVERY - puppet last run on db1092 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:22:43] will that freak anyone here out? [15:22:50] <_joe_> akosiaris: well it seems [15:23:21] my bouncer will have !log scrollback and I'll fx wikitech by hand when the upgrade is over [15:23:30] RECOVERY - puppet last run on aqs1006 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:23:44] bd808: ETA for the downtime? [15:23:48] * akosiaris prepares a for I in range(1,10^6) random !log lines [15:23:50] (03CR) 10Jcrespo: [C: 032] mariadb-multiinstance: Make the main multisource 3306 instance available [puppet] - 10https://gerrit.wikimedia.org/r/364744 (https://phabricator.wikimedia.org/T169510) (owner: 10Jcrespo) [15:23:56] (03PS4) 10Jcrespo: mariadb-multiinstance: Make the main multisource 3306 instance available [puppet] - 10https://gerrit.wikimedia.org/r/364744 (https://phabricator.wikimedia.org/T169510) [15:24:09] (03PS1) 10Urbanecm: Fix logos for srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364754 (https://phabricator.wikimedia.org/T168444) [15:24:10] PROBLEM - puppet last run on db2072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:24:20] PROBLEM - puppet last run on db2050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:24:21] volans: if all goes well, 15m. If not 30m to give up and roll back [15:24:34] _joe_: ok causes I am monitoring the BMC renaming status on https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?servicegroup=mgmt&style=overview&servicestatustypes=29 [15:24:42] looks like it's indeed finally updating [15:24:53] I'm force-running icinga [15:24:59] (for my change) [15:24:59] ? [15:25:04] puppet you mean ? [15:25:05] force-running puppet on icinga hosts [15:25:07] sorry :) [15:25:07] lol [15:25:08] ok [15:25:16] but icinga reloads fails, I'm not sure if it's transient yet [15:25:21] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2003078 [15:25:40] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:25:55] Could not find any host matching 'mw1180.mgmt' [15:25:56] bd808: if you don't are in a tight schedule maybe wait that this mess is over ;) [15:25:58] this is transient ^ [15:26:30] RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [15:26:36] volans: *nod* I think I'll just push to tomorrow and do something else this morning :) [15:26:52] * bd808 has plenty of targets to choose from [15:26:58] (03CR) 10Giuseppe Lavagetto: [C: 032] Add entries for service recommendation-api [dns] - 10https://gerrit.wikimedia.org/r/364457 (https://phabricator.wikimedia.org/T165760) (owner: 10Giuseppe Lavagetto) [15:27:18] (03PS7) 10Ottomata: Add statistics private, cruncher and web profiles, with a little refactoring [puppet] - 10https://gerrit.wikimedia.org/r/364741 (https://phabricator.wikimedia.org/T152712) [15:28:10] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:28:27] _joe_: I'm still not getting why so many were failing. The default for splaylimit is $runinterval and it's default is 30m. Our timeout in puppet-run is 1800 (default to s) so it should have mostly run, with longer splay but not killed by the timeout [15:28:50] <_joe_> volans: subsequent runs piling up on each other I think [15:28:57] <_joe_> so races upon races [15:29:01] 10Operations, 10ops-eqiad, 10netops: eqiad: rack frack refresh equipment - https://phabricator.wikimedia.org/T169644#3430940 (10ayounsi) For the naming, please use: pfw3a-eqiad and pfw3b-eqiad for the firewalls, and fasw-c1a-eqiad and fasw-c1b-eqiad for the switches Then for the cabling, please follow: {F87... [15:29:03] 10Operations, 10ops-eqiad, 10netops: eqiad: rack frack refresh equipment - https://phabricator.wikimedia.org/T169644#3430941 (10ayounsi) For the naming, please use: pfw3a-eqiad and pfw3b-eqiad for the firewalls, and fasw-c1a-eqiad and fasw-c1b-eqiad for the switches Then for the cabling, please follow: {F87... [15:29:08] could be [15:29:13] <_joe_> that's what I was seeing in the logs at least [15:29:28] <_joe_> anyways, I need that damn blacklist of down hosts [15:29:45] there is a -p95 in the command on wiki for that :-P [15:29:54] (03CR) 10Ottomata: "Mostly noop: https://puppet-compiler.wmflabs.org/7036/" [puppet] - 10https://gerrit.wikimedia.org/r/364741 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [15:29:56] (03CR) 10Ottomata: [C: 032] Add statistics private, cruncher and web profiles, with a little refactoring [puppet] - 10https://gerrit.wikimedia.org/r/364741 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [15:29:58] <_joe_> volans: it will still wait forever for some hosts to respond [15:30:05] (03PS8) 10Ottomata: Add statistics private, cruncher and web profiles, with a little refactoring [puppet] - 10https://gerrit.wikimedia.org/r/364741 (https://phabricator.wikimedia.org/T152712) [15:30:09] (03CR) 10Ottomata: [V: 032 C: 032] Add statistics private, cruncher and web profiles, with a little refactoring [puppet] - 10https://gerrit.wikimedia.org/r/364741 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [15:30:16] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [15:30:20] ssh timeout is 2s [15:30:26] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:31:06] RECOVERY - puppet last run on mw2201 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:31:27] RECOVERY - puppet last run on db2050 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:31:36] RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [15:31:56] PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:31:57] PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:32:39] (03PS3) 10Ayounsi: Add runbook link and remove
from Nagios interfaces check messages. [puppet] - 10https://gerrit.wikimedia.org/r/363309 [15:32:46] PROBLEM - puppet last run on cp1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:32:52] (03PS4) 10Ayounsi: Add runbook link and remove
from Nagios interfaces check messages. [puppet] - 10https://gerrit.wikimedia.org/r/363309 [15:33:03] (03PS5) 10Giuseppe Lavagetto: role::scb: add recommendation-api service [puppet] - 10https://gerrit.wikimedia.org/r/364451 (https://phabricator.wikimedia.org/T165760) [15:33:06] PROBLEM - Check whether ferm is active by checking the default input chain on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [15:33:06] PROBLEM - cassandra-b CQL 10.64.48.118:9042 on restbase-dev1003 is CRITICAL: connect to address 10.64.48.118 and port 9042: Connection refused [15:33:12] <_joe_> mobrovac: ^^ [15:33:26] +1 [15:34:18] PROBLEM - DPKG on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [15:34:18] PROBLEM - cassandra-b SSL 10.64.48.118:7001 on restbase-dev1003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:34:19] <_joe_> ok, so I'll disable puppet on lvs and dns hosts at first [15:34:21] (03CR) 10Mobrovac: [C: 031] role::scb: add recommendation-api service [puppet] - 10https://gerrit.wikimedia.org/r/364451 (https://phabricator.wikimedia.org/T165760) (owner: 10Giuseppe Lavagetto) [15:34:29] kk [15:35:04] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 1.0 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/364752 (https://phabricator.wikimedia.org/T169312) (owner: 10Gilles) [15:35:08] PROBLEM - Disk space on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [15:35:08] PROBLEM - cassandra-b service on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [15:35:09] PROBLEM - puppet last run on wtp2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:35:18] PROBLEM - puppet last run on wtp1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:35:58] PROBLEM - configured eth on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [15:36:28] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:36:38] PROBLEM - MD RAID on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [15:36:39] PROBLEM - dhclient process on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [15:36:39] PROBLEM - Restbase root url on restbase2001 is CRITICAL: connect to address 10.192.16.152 and port 7231: Connection refused [15:36:58] PROBLEM - puppet last run on es2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:37:39] PROBLEM - puppet last run on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [15:38:28] PROBLEM - Restbase root url on restbase-dev1003 is CRITICAL: connect to address 10.64.48.46 and port 7231: Connection refused [15:38:28] PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [15:38:28] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [15:38:45] can't reach --^, checking mgmt [15:38:53] (03PS6) 10Giuseppe Lavagetto: role::scb: add recommendation-api service [puppet] - 10https://gerrit.wikimedia.org/r/364451 (https://phabricator.wikimedia.org/T165760) [15:38:58] PROBLEM - puppet last run on db1060 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:38:58] <_joe_> mobrovac: merging! [15:39:08] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] role::scb: add recommendation-api service [puppet] - 10https://gerrit.wikimedia.org/r/364451 (https://phabricator.wikimedia.org/T165760) (owner: 10Giuseppe Lavagetto) [15:39:12] ok here we go [15:39:18] PROBLEM - puppet last run on mw2103 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:39:18] PROBLEM - SSH on restbase-dev1003 is CRITICAL: connect to address 10.64.48.46 and port 22: Connection refused [15:39:18] PROBLEM - salt-minion processes on restbase-dev1003 is CRITICAL: Return code of 255 is out of bounds [15:39:26] <_joe_> so I'm running puppet first on scb2001 [15:39:50] <_joe_> first, puppet merge which will add conftool data [15:40:10] <_joe_> hopefully :D [15:40:20] PROBLEM - cassandra-a CQL 10.64.48.117:9042 on restbase-dev1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:40:20] PROBLEM - Check size of conntrack table on restbase-dev1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:40:29] alright, let's see that [15:40:44] <_joe_> ok conftool worked :P [15:41:09] <_joe_> mobrovac: the software is already on tin, right? [15:41:16] yup _joe_ [15:41:36] <_joe_> shit, I fucked up something [15:41:44] euh? [15:41:51] <_joe_> admin.yaml [15:41:58] PROBLEM - salt-minion processes on stat1006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:42:19] PROBLEM - Host restbase-dev1003 is DOWN: PING CRITICAL - Packet loss = 100% [15:42:19] PROBLEM - puppet last run on restbase-dev1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:25] oh yes _joe_ [15:42:28] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:35] <_joe_> fuck me [15:42:38] (03PS1) 10Giuseppe Lavagetto: admin: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/364762 [15:42:42] _joe_: https://gerrit.wikimedia.org/r/#/c/364451/6/modules/admin/data/data.yaml line 530 missing semicolon [15:42:48] PROBLEM - puppet last run on naos is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:48] PROBLEM - puppet last run on ms-be1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:48] PROBLEM - puppet last run on mc1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:48] PROBLEM - puppet last run on ores1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:49] PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:50] comma, sorry [15:42:58] PROBLEM - puppet last run on mw1277 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:58] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] admin: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/364762 (owner: 10Giuseppe Lavagetto) [15:42:58] PROBLEM - puppet last run on db2053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:58] PROBLEM - puppet last run on db1099 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:59] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:42:59] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:08] PROBLEM - puppet last run on ms-be2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:10] PROBLEM - puppet last run on mw2165 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:10] PROBLEM - puppet last run on mw1281 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:10] PROBLEM - puppet last run on hassaleh is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:10] PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:10] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:10] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:18] PROBLEM - puppet last run on ms-be1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:18] PROBLEM - puppet last run on mc1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:18] PROBLEM - puppet last run on mc1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:18] PROBLEM - puppet last run on logstash1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:18] PROBLEM - puppet last run on scb1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:18] PROBLEM - puppet last run on db1071 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:18] PROBLEM - puppet last run on db1085 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:19] PROBLEM - puppet last run on mw2239 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:19] PROBLEM - puppet last run on ganeti1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:23] <_joe_> wow [15:43:26] <_joe_> just wow [15:43:28] PROBLEM - puppet last run on relforge1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:28] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:28] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:29] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:29] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:30] <_joe_> I merged 1 minute ago [15:43:32] <_joe_> :P [15:43:35] admin.yaml should be reformatted, btw [15:43:35] (03PS1) 10Ottomata: Apply cruncher_new role to stat1006 [puppet] - 10https://gerrit.wikimedia.org/r/364763 (https://phabricator.wikimedia.org/T152712) [15:43:38] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:38] PROBLEM - puppet last run on ms-be2020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:38] PROBLEM - puppet last run on es2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:38] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:38] PROBLEM - puppet last run on elastic1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:38] PROBLEM - puppet last run on mw1185 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:39] PROBLEM - puppet last run on labvirt1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:43] (03CR) 10Ayounsi: [C: 032] Add runbook link and remove
from Nagios interfaces check messages. [puppet] - 10https://gerrit.wikimedia.org/r/363309 (owner: 10Ayounsi) [15:43:48] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:48] PROBLEM - puppet last run on mw2236 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:48] PROBLEM - puppet last run on mw1207 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:48] PROBLEM - puppet last run on mw1213 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:49] PROBLEM - puppet last run on mw2119 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:49] PROBLEM - puppet last run on mw2204 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:49] PROBLEM - puppet last run on ms-be2036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:56] (03PS5) 10Ayounsi: Add runbook link and remove
from Nagios interfaces check messages. [puppet] - 10https://gerrit.wikimedia.org/r/363309 [15:43:58] PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:58] PROBLEM - puppet last run on ocg1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:58] PROBLEM - puppet last run on db1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:58] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:58] PROBLEM - puppet last run on mw1209 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:59] PROBLEM - puppet last run on mw2228 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:59] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:59] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:43:59] PROBLEM - puppet last run on db2054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:00] PROBLEM - puppet last run on db1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:04] PROBLEM - puppet last run on ores2007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:04] PROBLEM - puppet last run on mwdebug1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:08] PROBLEM - puppet last run on rdb1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:08] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:08] PROBLEM - puppet last run on db1096 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:08] PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:08] PROBLEM - puppet last run on thumbor1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:08] PROBLEM - puppet last run on mw2218 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:09] PROBLEM - puppet last run on restbase2009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:14] 10Operations, 10ops-codfw, 10monitoring: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307#3431009 (10Papaul) Case for mw2202 Hi Papaul, Thank you for contacting the Dell Enterprise Technical Support team. We have opened up a new case # 950882983 for issues re... [15:44:17] (03PS2) 10Ottomata: Apply cruncher_new role to stat1006 [puppet] - 10https://gerrit.wikimedia.org/r/364763 (https://phabricator.wikimedia.org/T152712) [15:44:18] PROBLEM - puppet last run on db2028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:18] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:18] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:18] PROBLEM - puppet last run on mw2150 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:18] PROBLEM - puppet last run on elastic2008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:18] PROBLEM - puppet last run on ms-be3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:28] PROBLEM - puppet last run on ms-be2029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:28] PROBLEM - puppet last run on wtp1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:28] PROBLEM - puppet last run on relforge1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:28] PROBLEM - puppet last run on db1084 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:28] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:29] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:38] PROBLEM - puppet last run on mw1267 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:38] PROBLEM - puppet last run on mc1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:38] PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:38] PROBLEM - puppet last run on wtp2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:38] PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:38] PROBLEM - puppet last run on mc2030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:38] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [15:44:39] PROBLEM - puppet last run on mc1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:39] PROBLEM - puppet last run on dbproxy1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:48] PROBLEM - puppet last run on mc2021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:48] PROBLEM - puppet last run on ms-be2023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:50] PROBLEM - puppet last run on analytics1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:50] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:50] PROBLEM - puppet last run on mw1279 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:50] PROBLEM - puppet last run on cp2023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:50] PROBLEM - puppet last run on db2086 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:58] <_joe_> can one minute cause that mess? [15:44:58] PROBLEM - puppet last run on dysprosium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:58] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:58] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:58] PROBLEM - puppet last run on mw1162 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:58] PROBLEM - puppet last run on analytics1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:59] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:59] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:59] PROBLEM - puppet last run on aqs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:44:59] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:00] PROBLEM - puppet last run on ms-be2022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:00] PROBLEM - puppet last run on mw1202 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:08] PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:08] PROBLEM - puppet last run on mw1184 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:08] PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:08] PROBLEM - puppet last run on thumbor2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:09] PROBLEM - puppet last run on elastic2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:09] PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:09] PROBLEM - puppet last run on mw1232 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:09] PROBLEM - puppet last run on mw2198 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:10] PROBLEM - puppet last run on wtp1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:10] PROBLEM - puppet last run on poolcounter2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:11] PROBLEM - puppet last run on ganeti2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:27] PROBLEM - puppet last run on ores2009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:27] PROBLEM - puppet last run on restbase-dev1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:28] <_joe_> sorry everyone [15:45:37] PROBLEM - puppet last run on ms-fe2008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:38] PROBLEM - puppet last run on labcontrol1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:38] PROBLEM - puppet last run on mw2116 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:38] PROBLEM - puppet last run on pc2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:38] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:38] PROBLEM - puppet last run on wtp1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:38] PROBLEM - puppet last run on mc1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:38] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:38] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [15:45:39] PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:48] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:48] i see an ok in there finally =] [15:45:48] PROBLEM - puppet last run on es2011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:57] PROBLEM - puppet last run on ganeti2007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:57] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:58] PROBLEM - puppet last run on radium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:58] PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:58] PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:45:58] PROBLEM - puppet last run on mw2249 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:46:07] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:46:07] PROBLEM - puppet last run on acamar is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:46:17] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:46:37] PROBLEM - puppet last run on wdqs2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:46:38] PROBLEM - puppet last run on mw2137 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:46:38] PROBLEM - puppet last run on mw2217 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:46:39] PROBLEM - puppet last run on wtp2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:47:18] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:47:18] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:47:24] PROBLEM - Host recommendation-api.svc.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [15:47:29] PROBLEM - Host recommendation-api.svc.eqiad.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [15:47:39] PROBLEM - puppet last run on thumbor1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:47:46] ok, the recommendation-api paged [15:47:56] yeah ignore [15:47:57] i assume its different than the puppet fails [15:47:59] ok [15:48:10] <_joe_> discard the pages, my fault [15:48:13] okey dokey [15:48:19] PROBLEM - puppet last run on elastic2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:00] (03CR) 10Ottomata: [V: 032 C: 032] Apply cruncher_new role to stat1006 [puppet] - 10https://gerrit.wikimedia.org/r/364763 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [15:49:09] RECOVERY - puppet last run on cp1049 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [15:49:20] PROBLEM - puppet last run on db1041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:20] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:20] PROBLEM - puppet last run on db1045 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:20] PROBLEM - puppet last run on db1071 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:20] PROBLEM - puppet last run on db2066 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:35] (03CR) 10Ayounsi: [V: 032 C: 032] Add runbook link and remove
from Nagios interfaces check messages. [puppet] - 10https://gerrit.wikimedia.org/r/363309 (owner: 10Ayounsi) [15:49:39] PROBLEM - puppet last run on ms-fe2007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:39] PROBLEM - puppet last run on mw2183 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:39] PROBLEM - puppet last run on mw2221 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:49] (03PS6) 10Ayounsi: Add runbook link and remove
from Nagios interfaces check messages. [puppet] - 10https://gerrit.wikimedia.org/r/363309 [15:49:49] PROBLEM - puppet last run on mw2201 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:49:52] (03CR) 10Ayounsi: [V: 032 C: 032] Add runbook link and remove
from Nagios interfaces check messages. [puppet] - 10https://gerrit.wikimedia.org/r/363309 (owner: 10Ayounsi) [15:50:01] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:50:33] PROBLEM - puppet last run on db1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:50:42] (03PS1) 10Jcrespo: dbstore3: tune down some of the parameters due to multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364766 (https://phabricator.wikimedia.org/T169514) [15:50:43] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:50:52] RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [15:51:03] PROBLEM - puppet last run on cp2023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:51:12] !log mobrovac@tin Started deploy [recommendation-api/deploy@ed41fc4]: Initial deploy on canary scb2001 - T165760 [15:51:13] PROBLEM - puppet last run on db2054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:51:23] PROBLEM - puppet last run on db2028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:24] T165760: Deploy Recommendation API as a service - https://phabricator.wikimedia.org/T165760 [15:51:27] !log mobrovac@tin Finished deploy [recommendation-api/deploy@ed41fc4]: Initial deploy on canary scb2001 - T165760 (duration: 00m 15s) [15:51:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:44] PROBLEM - puppet last run on mw2128 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:51:44] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [15:51:57] (03PS2) 10Milimetric: Clone wikistats v2 repository and link it to v2 [puppet] - 10https://gerrit.wikimedia.org/r/362118 (https://phabricator.wikimedia.org/T167684) [15:52:02] _joe_: puppet still failing? need a hand? [15:52:03] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:52:03] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:52:03] PROBLEM - puppet last run on darmstadtium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:52:03] PROBLEM - puppet last run on db1035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:52:03] PROBLEM - puppet last run on db1039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:52:03] PROBLEM - puppet last run on db1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:52:12] PROBLEM - puppet last run on labvirt1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:52:13] PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:52:17] !log mobrovac@tin Started deploy [recommendation-api/deploy@ed41fc4]: Initial deploy on canary scb2001, take #2 - T165760 [15:52:18] (03PS2) 10Jcrespo: dbstore3: tune down some of the parameters due to multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364766 (https://phabricator.wikimedia.org/T169514) [15:52:23] !log mobrovac@tin Finished deploy [recommendation-api/deploy@ed41fc4]: Initial deploy on canary scb2001, take #2 - T165760 (duration: 00m 06s) [15:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:32] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:37] RECOVERY - Host recommendation-api.svc.eqiad.wmnet is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [15:52:39] <_joe_> volans: it should be unrelated to my change [15:52:42] PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:52:53] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:52:54] PROBLEM - puppet last run on db1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:53:02] PROBLEM - puppet last run on db1069 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:53:02] PROBLEM - puppet last run on db1060 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:53:02] PROBLEM - puppet last run on db2053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:53:03] PROBLEM - puppet last run on db2064 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:53:03] PROBLEM - puppet last run on dbstore2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:53:12] PROBLEM - puppet last run on db2087 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:53:12] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:53:37] <_joe_> volans: someone must have launched some script that is running puppet everywhere [15:53:42] RECOVERY - puppet last run on wtp2010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [15:53:49] <_joe_> and it happened to be running during the problem I guess [15:53:52] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:53:52] PROBLEM - puppet last run on db1085 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:53:53] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:53:53] PROBLEM - puppet last run on mc1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:53:56] <_joe_> but lemme check [15:54:02] RECOVERY - puppet last run on db1048 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [15:54:02] RECOVERY - puppet last run on db1069 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:54:05] _joe_: checked [15:54:06] (03CR) 10Jcrespo: [V: 032 C: 032] dbstore3: tune down some of the parameters due to multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/364766 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [15:54:09] already fixed by alex [15:54:13] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [15:54:21] <_joe_> volans: fixed what? [15:54:33] PROBLEM - puppet last run on ms-be1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:54:41] <_joe_> I did fix the error in admin/data.yaml [15:54:42] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:54:48] PROBLEM - mysqld processes on db1102 is CRITICAL: PROCS CRITICAL: 3 processes with command name mysqld [15:54:48] PROBLEM - puppet last run on db1096 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:54:52] PROBLEM - puppet last run on mendelevium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:55:02] RECOVERY - puppet last run on db1045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:04] ok [15:55:13] RECOVERY - puppet last run on mc1020 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:55:23] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp3009_v4, cp3009_v6 [15:55:33] RECOVERY - puppet last run on radium is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:55:42] PROBLEM - puppet last run on mw2141 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:55:42] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:55:52] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:55:52] PROBLEM - puppet last run on db1084 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:55:52] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:55:52] PROBLEM - puppet last run on db2057 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:55:52] PROBLEM - puppet last run on db2068 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:55:52] PROBLEM - puppet last run on mc2035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:55:53] RECOVERY - puppet last run on es2012 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [15:55:53] PROBLEM - puppet last run on es2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:56:04] RECOVERY - puppet last run on ganeti1004 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [15:56:12] RECOVERY - Check systemd state on ms-be1036 is OK: OK - running: The system is fully operational [15:56:13] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:56:14] * volans blames the prvious splay issue _joe_ [15:56:32] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:56:35] 10Operations, 10DBA, 10Scoring-platform-team, 10cloud-services-team: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584#3431085 (10madhuvishy) [15:56:38] !log mobrovac@tin Started deploy [recommendation-api/deploy@ed41fc4]: Initial deploy on canary scb2001, take #3 - T165760 [15:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:49] T165760: Deploy Recommendation API as a service - https://phabricator.wikimedia.org/T165760 [15:57:20] PROBLEM - puppet last run on cp2011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:57:24] !log mobrovac@tin Finished deploy [recommendation-api/deploy@ed41fc4]: Initial deploy on canary scb2001, take #3 - T165760 (duration: 00m 46s) [15:57:29] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:57:30] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:57:30] PROBLEM - puppet last run on lvs2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:49] RECOVERY - puppet last run on db1084 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [15:57:59] PROBLEM - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp3009_v4, cp3009_v6 [15:58:00] RECOVERY - puppet last run on ms-fe2008 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [15:58:19] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /translation/articles/{source}/{target}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 404 (expecting: 200): /translation/articles/{source}/{target}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200): / [15:58:19] s/{source}/{target}{/seed} (bad source) is CRITICAL: Test bad source returned the unexpected status 404 (expecting: 504) [15:58:29] RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:58:39] PROBLEM - puppet last run on kafka1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:58:59] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:59:19] RECOVERY - puppet last run on wtp2002 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:59:20] RECOVERY - puppet last run on ganeti2005 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:59:20] PROBLEM - puppet last run on mc2034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:59:29] PROBLEM - puppet last run on mw2232 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:59:30] RECOVERY - salt-minion processes on ms-be1036 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:59:39] RECOVERY - puppet last run on db2072 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:59:40] RECOVERY - puppet last run on elastic2004 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:59:49] PROBLEM - MD RAID on stat1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:59:59] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [15:59:59] PROBLEM - DPKG on stat1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:59:59] PROBLEM - puppet last run on db2049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:00:19] RECOVERY - puppet last run on labvirt1014 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [16:00:30] RECOVERY - puppet last run on ms-be1014 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [16:00:39] RECOVERY - puppet last run on naos is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [16:00:40] PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp3009_v4, cp3009_v6 [16:00:49] RECOVERY - MD RAID on stat1006 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [16:00:49] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp3009_v4, cp3009_v6 [16:00:59] RECOVERY - puppet last run on ms-be2017 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:01:19] RECOVERY - puppet last run on db1060 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [16:01:30] RECOVERY - puppet last run on mw1185 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [16:01:39] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [16:01:39] PROBLEM - Check size of conntrack table on stat1006 is CRITICAL: NRPE: Command check_conntrack_table_size not defined [16:02:09] PROBLEM - puppet last run on db1099 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:02:19] RECOVERY - puppet last run on wtp1015 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:02:30] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [16:02:40] RECOVERY - Check size of conntrack table on stat1006 is OK: OK: nf_conntrack is 0 % full [16:02:40] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:02:49] RECOVERY - puppet last run on ores2009 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:02:59] RECOVERY - DPKG on stat1006 is OK: All packages OK [16:03:10] PROBLEM - Confd template for /var/lib/gdnsd/discovery-recommendation-api.state on cp1008 is CRITICAL: File not found: /var/lib/gdnsd/discovery-recommendation-api.state [16:03:10] RECOVERY - puppet last run on wdqs2003 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [16:03:10] RECOVERY - puppet last run on ores2007 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [16:03:15] (03PS3) 10Milimetric: Clone wikistats v2 repository and link it to v2 [puppet] - 10https://gerrit.wikimedia.org/r/362118 (https://phabricator.wikimedia.org/T167684) [16:03:49] PROBLEM - puppet last run on scb2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[recommendation-api/deploy] [16:04:27] RECOVERY - puppet last run on mw1209 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [16:04:47] RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 8 minutes ago with 0 failures [16:04:58] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:05:28] PROBLEM - MegaRAID on db1066 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [16:05:29] ACKNOWLEDGEMENT - MegaRAID on db1066 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T170433 [16:05:34] 10Operations, 10ops-eqiad: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T170433#3431124 (10ops-monitoring-bot) [16:05:37] RECOVERY - puppet last run on mw1279 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:05:38] PROBLEM - puppet last run on mw2227 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:05:47] RECOVERY - puppet last run on ms-be2036 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [16:05:57] RECOVERY - puppet last run on kafka1018 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [16:05:58] RECOVERY - puppet last run on mw1277 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [16:05:59] !log mobrovac@tin Started deploy [recommendation-api/deploy@eb2fef3]: (no justification provided) [16:06:07] RECOVERY - puppet last run on mw2204 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:06:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:17] RECOVERY - puppet last run on mw2128 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [16:06:17] RECOVERY - puppet last run on mw2239 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [16:06:17] RECOVERY - puppet last run on mw2137 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [16:06:20] !log lvs1006, lvs1010: upgrade pybal to 1.13.7 T82747 T154759 [16:06:28] PROBLEM - puppet last run on elastic1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:06:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:31] T82747: pybal health checks are ipv4 even for ipv6 vips - https://phabricator.wikimedia.org/T82747 [16:06:31] T154759: Pybal not happy with DNS delays - https://phabricator.wikimedia.org/T154759 [16:06:33] !log mobrovac@tin Finished deploy [recommendation-api/deploy@eb2fef3]: (no justification provided) (duration: 00m 33s) [16:06:40] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T170433#3431134 (10Volans) p:05Triage>03Normal [16:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:57] volans: there was already a ticket for that host I believe [16:07:07] marostegui: ah sorry [16:07:09] RECOVERY - puppet last run on db2028 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [16:07:10] (03CR) 10jerkins-bot: [V: 04-1] Clone wikistats v2 repository and link it to v2 [puppet] - 10https://gerrit.wikimedia.org/r/362118 (https://phabricator.wikimedia.org/T167684) (owner: 10Milimetric) [16:07:18] volans: let me double check [16:07:20] feel free to close as duplicate marostegui [16:07:29] PROBLEM - puppet last run on mw2180 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:07:30] yeah, there is [16:07:31] https://phabricator.wikimedia.org/T169448 [16:07:36] I will, thanks volans :) [16:07:48] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:07:51] RECOVERY - puppet last run on mw2249 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:08:08] RECOVERY - puppet last run on scb2002 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:08:28] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T170433#3431142 (10Marostegui) 05Open>03declined There was already a ticket for that host: T169448 [16:09:00] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T169448#3431147 (10jcrespo) [16:09:03] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T170433#3431149 (10jcrespo) [16:09:18] RECOVERY - puppet last run on db2057 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [16:09:24] !log upload thumbor 1.0-1 to install1002 [16:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:38] RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:09:39] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - dns_rec6_53 - Could not depool server hydrogen.wikimedia.org because of too many down!: dns_rec6_53_udp - Could not depool server hydrogen.wikimedia.org because of too many down! [16:09:39] RECOVERY - puppet last run on hassaleh is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:10:08] PROBLEM - puppet last run on scb2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[recommendation-api/deploy] [16:10:29] RECOVERY - puppet last run on mw2103 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:10:39] RECOVERY - puppet last run on elastic1029 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [16:11:00] RECOVERY - puppet last run on ms-be2022 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:11:00] RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy [16:11:01] RECOVERY - puppet last run on ms-be2020 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [16:11:11] RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [16:11:11] RECOVERY - puppet last run on ores1005 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [16:11:11] RECOVERY - puppet last run on db1054 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:11:20] RECOVERY - puppet last run on mw2165 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [16:11:20] RECOVERY - puppet last run on mc1004 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [16:11:20] RECOVERY - puppet last run on mendelevium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:11:21] RECOVERY - puppet last run on mc1031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:11:21] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /translation/articles/{source}/{target}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 404 (expecting: 200): /translation/articles/{source}/{target}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200): / [16:11:21] s/{source}/{target}{/seed} (bad source) is CRITICAL: Test bad source returned the unexpected status 404 (expecting: 504) [16:11:30] RECOVERY - puppet last run on mc1035 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [16:11:31] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:11:31] RECOVERY - puppet last run on mw2183 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [16:11:41] RECOVERY - puppet last run on cp2010 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [16:11:50] RECOVERY - puppet last run on db2053 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [16:11:50] RECOVERY - puppet last run on es2004 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:11:51] RECOVERY - puppet last run on ocg1003 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [16:12:00] RECOVERY - puppet last run on labcontrol1003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [16:12:00] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [16:12:00] RECOVERY - puppet last run on mw1281 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:12:00] RECOVERY - puppet last run on rdb1008 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [16:12:09] ottomata: Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class geowiki::params for helium.eqiad.wmnet on node helium.eqiad.wmnet [16:12:09] Warning: Not using cache on failed catalog [16:12:09] Error: Could not retrieve catalog; skipping run [16:12:10] RECOVERY - puppet last run on mw2221 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [16:12:11] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [16:12:11] RECOVERY - puppet last run on thumbor1004 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [16:12:11] RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:12:20] RECOVERY - puppet last run on db1099 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [16:12:20] RECOVERY - puppet last run on mw2141 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures [16:12:20] RECOVERY - puppet last run on restbase-dev1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:12:20] RECOVERY - puppet last run on ms-be1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:12:20] RECOVERY - puppet last run on logstash1004 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [16:12:20] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [16:12:21] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [16:12:21] RECOVERY - puppet last run on db1096 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [16:12:21] RECOVERY - puppet last run on db2086 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [16:12:30] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [16:12:31] RECOVERY - puppet last run on mw1267 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [16:12:31] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [16:12:32] ottomata: something about geowiki::params broke the backup server [16:12:40] RECOVERY - puppet last run on mw2180 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [16:12:41] RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:12:41] RECOVERY - puppet last run on elastic1027 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [16:12:41] RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:12:41] RECOVERY - puppet last run on mc2030 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [16:12:41] RECOVERY - puppet last run on wtp2016 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:12:41] RECOVERY - puppet last run on mwdebug1002 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [16:12:48] ! [16:12:50] RECOVERY - puppet last run on restbase-dev1002 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:12:55] akosiaris: hmm [16:13:00] RECOVERY - puppet last run on db1039 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [16:13:00] RECOVERY - puppet last run on poolcounter2002 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [16:13:00] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:13:00] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [16:13:00] RECOVERY - puppet last run on ms-be2023 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [16:13:00] RECOVERY - puppet last run on analytics1067 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [16:13:00] RECOVERY - puppet last run on aqs1005 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [16:13:10] RECOVERY - puppet last run on mw1184 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:13:10] RECOVERY - puppet last run on mw1202 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:13:10] RECOVERY - puppet last run on es2003 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [16:13:10] RECOVERY - puppet last run on db1041 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:13:10] RECOVERY - puppet last run on db2087 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [16:13:10] RECOVERY - puppet last run on mw2228 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:13:11] RECOVERY - puppet last run on mw2218 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [16:13:11] RECOVERY - puppet last run on mw2119 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [16:13:11] RECOVERY - puppet last run on db2054 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [16:13:12] ah! [16:13:16] got it akosiaris [16:13:18] sorry missed that one [16:13:20] RECOVERY - puppet last run on restbase2009 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:13:20] RECOVERY - puppet last run on wtp1003 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [16:13:20] RECOVERY - puppet last run on thumbor2004 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [16:13:21] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [16:13:27] ottomata: no worries [16:13:29] thanks! [16:13:30] RECOVERY - puppet last run on mc2035 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:13:30] RECOVERY - puppet last run on thumbor1003 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:13:30] RECOVERY - puppet last run on ms-fe2007 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [16:13:30] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [16:13:31] RECOVERY - puppet last run on elastic1018 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [16:13:40] RECOVERY - puppet last run on mc1023 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:13:40] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [16:13:40] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [16:13:40] RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [16:13:40] RECOVERY - puppet last run on labvirt1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:13:41] RECOVERY - puppet last run on mw2150 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [16:13:41] RECOVERY - puppet last run on mw2236 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:13:50] RECOVERY - puppet last run on analytics1058 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:13:50] RECOVERY - puppet last run on analytics1052 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [16:13:50] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:13:50] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /translation/articles/{source}/{target}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 404 (expecting: 200): /translation/articles/{source}/{target}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200): / [16:13:52] s/{source}/{target}{/seed} (bad source) is CRITICAL: Test bad source returned the unexpected status 404 (expecting: 504) [16:13:52] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [16:13:52] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - dns_rec6_53 - Could not depool server chromium.wikimedia.org because of too many down!: dns_rec6_53_udp - Could not depool server chromium.wikimedia.org because of too many down! [16:14:00] RECOVERY - puppet last run on db1085 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:14:00] PROBLEM - puppet last run on scb2006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 15 seconds ago with 1 failures. Failed resources (up to 3 shown) [16:14:00] RECOVERY - puppet last run on mc2034 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [16:14:00] RECOVERY - puppet last run on mc2021 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [16:14:00] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:14:00] RECOVERY - puppet last run on dysprosium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:14:00] RECOVERY - puppet last run on mw1275 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [16:14:10] RECOVERY - puppet last run on ganeti2007 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:14:10] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:14:10] RECOVERY - puppet last run on scb2004 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:14:11] RECOVERY - puppet last run on mw2198 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:14:11] RECOVERY - puppet last run on acamar is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:14:30] RECOVERY - puppet last run on mw1232 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [16:14:30] RECOVERY - puppet last run on db2068 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [16:14:30] RECOVERY - puppet last run on es2011 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [16:14:30] RECOVERY - puppet last run on db1035 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:14:30] RECOVERY - puppet last run on db2075 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:14:30] RECOVERY - puppet last run on elastic2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:14:31] RECOVERY - puppet last run on wtp1044 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:14:40] RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:14:41] RECOVERY - puppet last run on mw2232 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [16:14:45] !log downgrade pybal to 1.13.6 on lvs1010 T82747 T154759 (1.13.7 throwing exceptions) [16:14:50] RECOVERY - puppet last run on cp2023 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:14:50] RECOVERY - puppet last run on db2064 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [16:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:57] T82747: pybal health checks are ipv4 even for ipv6 vips - https://phabricator.wikimedia.org/T82747 [16:14:57] T154759: Pybal not happy with DNS delays - https://phabricator.wikimedia.org/T154759 [16:15:00] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:15:03] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:15:03] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:15:03] RECOVERY - puppet last run on scb2006 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:15:05] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: service=recommendation-api,dc=codfw [16:15:11] RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [16:15:11] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [16:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:30] RECOVERY - puppet last run on ms-be1029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:15:30] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [16:15:41] RECOVERY - puppet last run on mw2217 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [16:15:50] RECOVERY - puppet last run on mw1162 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:16:00] RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [16:16:00] RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy [16:16:11] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [16:16:20] RECOVERY - puppet last run on db1044 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:16:20] RECOVERY - puppet last run on labvirt1015 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:16:30] RECOVERY - puppet last run on relforge1002 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [16:16:40] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [16:16:40] RECOVERY - puppet last run on mc1008 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [16:16:46] (03PS1) 10Ottomata: Remove references to geowiki::params [puppet] - 10https://gerrit.wikimedia.org/r/364770 (https://phabricator.wikimedia.org/T152712) [16:17:00] RECOVERY - puppet last run on dbstore2001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:17:00] RECOVERY - puppet last run on mw1213 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [16:17:10] RECOVERY - puppet last run on darmstadtium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:17:10] RECOVERY - puppet last run on db1071 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures [16:17:22] (03CR) 10Ottomata: [V: 032 C: 032] Remove references to geowiki::params [puppet] - 10https://gerrit.wikimedia.org/r/364770 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [16:18:10] RECOVERY - puppet last run on ms-be3003 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [16:19:00] RECOVERY - puppet last run on db2066 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [16:19:31] (03PS1) 10Ottomata: geowiki already uses /srv on stat1003, so we can change the backup director path now [puppet] - 10https://gerrit.wikimedia.org/r/364771 (https://phabricator.wikimedia.org/T152712) [16:19:42] (03CR) 10Ottomata: [V: 032 C: 032] geowiki already uses /srv on stat1003, so we can change the backup director path now [puppet] - 10https://gerrit.wikimedia.org/r/364771 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [16:19:50] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [16:19:50] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 7 minutes ago with 0 failures [16:20:50] RECOVERY - puppet last run on mw1207 is OK: OK: Puppet is currently enabled, last run 8 minutes ago with 0 failures [16:20:50] RECOVERY - puppet last run on ms-be2029 is OK: OK: Puppet is currently enabled, last run 9 minutes ago with 0 failures [16:20:50] RECOVERY - puppet last run on wtp1046 is OK: OK: Puppet is currently enabled, last run 8 minutes ago with 0 failures [16:21:20] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [16:23:41] RECOVERY - puppet last run on mc1010 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [16:23:50] RECOVERY - puppet last run on mw2116 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [16:23:50] RECOVERY - puppet last run on pc2004 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [16:25:40] RECOVERY - puppet last run on cp2011 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [16:27:20] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:27:42] RECOVERY - puppet last run on db2049 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [16:28:30] PROBLEM - puppet last run on scb1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 32 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[recommendation-api/deploy] [16:28:50] PROBLEM - puppet last run on scb1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 43 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[recommendation-api/deploy] [16:29:11] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[recommendation-api/deploy] [16:29:40] RECOVERY - puppet last run on relforge1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [16:30:00] RECOVERY - puppet last run on dbproxy1004 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [16:30:10] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [16:30:50] RECOVERY - puppet last run on scb1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:31:10] RECOVERY - puppet last run on mw2201 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [16:33:30] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /translation/articles/{source}/{target}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 404 (expecting: 200): /translation/articles/{source}/{target}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200): / [16:33:30] s/{source}/{target}{/seed} (bad source) is CRITICAL: Test bad source returned the unexpected status 404 (expecting: 504) [16:33:40] RECOVERY - puppet last run on scb1004 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [16:34:20] RECOVERY - puppet last run on mw2227 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:35:20] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /translation/articles/{source}/{target}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 404 (expecting: 200): /translation/articles/{source}/{target}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200): / [16:35:20] s/{source}/{target}{/seed} (bad source) is CRITICAL: Test bad source returned the unexpected status 404 (expecting: 504) [16:36:10] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [16:36:29] !log oblivian@puppetmaster1001 conftool action : set/pooled=yes; selector: service=recommendation-api,dc=eqiad [16:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:20] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - wdqs_80 - Could not depool server wdqs2001.codfw.wmnet because of too many down! [16:38:30] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - wdqs_80 - Could not depool server wdqs2003.codfw.wmnet because of too many down! [16:38:40] RECOVERY - puppet last run on lvs2003 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [16:38:53] (03PS1) 10Ottomata: Package fixes for stat boxes to stretch [puppet] - 10https://gerrit.wikimedia.org/r/364773 (https://phabricator.wikimedia.org/T152712) [16:39:01] <_joe_> can someone check wdqs in codfw? [16:39:11] <_joe_> I am doing something else [16:40:36] RECOVERY - Host recommendation-api.svc.codfw.wmnet is UP: PING OK - Packet loss = 0%, RTA = 36.00 ms [16:41:41] seems wdqs service was stopped and changed to runnig with a puppet run on wdqs2003 [16:42:02] <_joe_> robh: look at the logs for the service (no idea where atm sorry) [16:42:15] yeah im still grepping about [16:42:24] i had two down, just fired puppet on the one to see what it would do [16:42:31] (03CR) 10Ottomata: [V: 032 C: 032] Package fixes for stat boxes to stretch [puppet] - 10https://gerrit.wikimedia.org/r/364773 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [16:44:08] RECOVERY - puppet last run on elastic2008 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [16:44:42] (03PS1) 10Ottomata: Fix typo in statistics::packages [puppet] - 10https://gerrit.wikimedia.org/r/364774 [16:44:53] (03CR) 10Ottomata: [V: 032 C: 032] Fix typo in statistics::packages [puppet] - 10https://gerrit.wikimedia.org/r/364774 (owner: 10Ottomata) [16:46:01] (03PS1) 10Giuseppe Lavagetto: conftool-data: correctly rename discovery object [puppet] - 10https://gerrit.wikimedia.org/r/364775 [16:46:18] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:46:24] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] conftool-data: correctly rename discovery object [puppet] - 10https://gerrit.wikimedia.org/r/364775 (owner: 10Giuseppe Lavagetto) [16:47:30] (03PS1) 10Ottomata: Invert os_version check conditional in statistics::packages [puppet] - 10https://gerrit.wikimedia.org/r/364776 [16:47:48] RECOVERY - Confd template for /var/lib/gdnsd/discovery-recommendation-api.state on cp1008 is OK: No errors detected [16:47:59] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=recommendation-api [16:48:04] (03CR) 10Ottomata: [V: 032 C: 032] Invert os_version check conditional in statistics::packages [puppet] - 10https://gerrit.wikimedia.org/r/364776 (owner: 10Ottomata) [16:48:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:11] (03PS2) 10Ottomata: Invert os_version check conditional in statistics::packages [puppet] - 10https://gerrit.wikimedia.org/r/364776 [16:48:14] (03CR) 10Ottomata: [V: 032 C: 032] Invert os_version check conditional in statistics::packages [puppet] - 10https://gerrit.wikimedia.org/r/364776 (owner: 10Ottomata) [16:48:58] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /translation/articles/{source}/{target}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 404 (expecting: 200): /translation/articles/{source}/{target}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200): / [16:48:58] s/{source}/{target}{/seed} (bad source) is CRITICAL: Test bad source returned the unexpected status 404 (expecting: 504) [16:50:38] (03PS3) 10Giuseppe Lavagetto: Add discovery DNS entry for service recommendation-api [dns] - 10https://gerrit.wikimedia.org/r/364458 (https://phabricator.wikimedia.org/T165760) [16:50:38] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /translation/articles/{source}/{target}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 404 (expecting: 200): /translation/articles/{source}/{target}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200): / [16:50:38] s/{source}/{target}{/seed} (bad source) is CRITICAL: Test bad source returned the unexpected status 404 (expecting: 504) [16:50:38] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /translation/articles/{source}/{target}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 404 (expecting: 200): /translation/articles/{source}/{target}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200): / [16:50:39] s/{source}/{target}{/seed} (bad source) is CRITICAL: Test bad source returned the unexpected status 404 (expecting: 504) [16:51:06] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: pybal health checks are ipv4 even for ipv6 vips - https://phabricator.wikimedia.org/T82747#904929 (10ema) Almost there! I had to downgrade pybal to 1.13.6 on lvs1010 because of the following exception: ``` File "/usr/lib/python2.7/dist-packages/twist... [16:51:14] (03PS1) 10Ottomata: Require libgdal-dev in stretch in statistics::packages [puppet] - 10https://gerrit.wikimedia.org/r/364778 [16:51:30] (03CR) 10Ottomata: [V: 032 C: 032] Require libgdal-dev in stretch in statistics::packages [puppet] - 10https://gerrit.wikimedia.org/r/364778 (owner: 10Ottomata) [16:51:38] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /translation/articles/{source}/{target}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 404 (expecting: 200): /translation/articles/{source}/{target}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200): / [16:51:39] s/{source}/{target}{/seed} (bad source) is CRITICAL: Test bad source returned the unexpected status 404 (expecting: 504) [16:52:18] !log roll-restart and upgrade thumbor in eqiad [16:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:19] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /translation/articles/{source}/{target}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 404 (expecting: 200): /translation/articles/{source}/{target}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200): / [16:53:19] s/{source}/{target}{/seed} (bad source) is CRITICAL: Test bad source returned the unexpected status 404 (expecting: 504) [16:53:28] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:53:47] ^ is me [16:55:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10Services, 10User-fgiunchedi: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3431368 (10GWicke) 05Open>03Resolved @eevans, is there anything left to do here? [16:55:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10Services, 10User-fgiunchedi: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3431371 (10GWicke) 05Resolved>03Open [16:55:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10Services (watching), 10User-fgiunchedi: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3287641 (10GWicke) [16:56:08] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [16:59:01] (03PS1) 10Ottomata: MySQL client for stastistics::packages in stretch, /srv/mediawiki dir [puppet] - 10https://gerrit.wikimedia.org/r/364782 (https://phabricator.wikimedia.org/T152712) [16:59:20] (03CR) 10Ottomata: [V: 032 C: 032] MySQL client for stastistics::packages in stretch, /srv/mediawiki dir [puppet] - 10https://gerrit.wikimedia.org/r/364782 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [17:00:04] RainbowSprinkles and Jdlrobson: Dear anthropoid, the time has come. Please deploy Special (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170712T1700). [17:00:06] 10Operations, 10ops-eqiad: Decommission mw1196 - https://phabricator.wikimedia.org/T170441#3431403 (10Cmjohnson) [17:01:45] (03PS2) 10Faidon Liambotis: icinga: move RIPE Atlas host monitoring under netops [puppet] - 10https://gerrit.wikimedia.org/r/364207 (https://phabricator.wikimedia.org/T167279) [17:01:58] (03CR) 10Faidon Liambotis: [C: 032] icinga: move RIPE Atlas host monitoring under netops [puppet] - 10https://gerrit.wikimedia.org/r/364207 (https://phabricator.wikimedia.org/T167279) (owner: 10Faidon Liambotis) [17:03:07] 10Operations, 10Performance-Team, 10Thumbor, 10Patch-For-Review: Implement PoolCounter support in Thumbor - https://phabricator.wikimedia.org/T151066#3431426 (10Gilles) [17:03:10] 10Operations, 10Performance-Team, 10Thumbor, 10Patch-For-Review: Implement poolcounter failover in Thumbor - https://phabricator.wikimedia.org/T169312#3431425 (10Gilles) 05Open>03Resolved [17:05:10] PROBLEM - LVS HTTP IPv4 on wdqs.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.073 second response time [17:05:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10Services (watching), 10User-fgiunchedi: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3431435 (10Cmjohnson) [17:06:39] (03PS1) 10Ottomata: Own mediawiki/core checkout as stats [puppet] - 10https://gerrit.wikimedia.org/r/364784 (https://phabricator.wikimedia.org/T152712) [17:06:41] (03CR) 10Faidon Liambotis: [V: 032 C: 032] icinga: move RIPE Atlas host monitoring under netops [puppet] - 10https://gerrit.wikimedia.org/r/364207 (https://phabricator.wikimedia.org/T167279) (owner: 10Faidon Liambotis) [17:06:53] (03CR) 10Ottomata: [V: 032 C: 032] Own mediawiki/core checkout as stats [puppet] - 10https://gerrit.wikimedia.org/r/364784 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [17:07:37] !log upgrading nginx on mwdebug* [17:07:43] (03PS3) 10Faidon Liambotis: icinga: move RIPE Atlas measurements under netops [puppet] - 10https://gerrit.wikimedia.org/r/364208 [17:07:47] (03PS2) 10Ottomata: Own mediawiki/core checkout as stats [puppet] - 10https://gerrit.wikimedia.org/r/364784 (https://phabricator.wikimedia.org/T152712) [17:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:54] (03CR) 10Ottomata: [V: 032 C: 032] Own mediawiki/core checkout as stats [puppet] - 10https://gerrit.wikimedia.org/r/364784 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [17:09:07] \o [17:09:10] RECOVERY - LVS HTTP IPv4 on wdqs.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 13048 bytes in 0.073 second response time [17:09:52] RainbowSprinkles: Step 1) Minerva skin patch is ready to go https://gerrit.wikimedia.org/r/358141 [17:10:17] Ok, so let's merge that and backport to wmf.9 [17:10:51] Oh, it didn't get branched as a stub extension, I thought we added to make-wmf-branch [17:12:18] PROBLEM - LVS HTTP IPv4 on wdqs.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.073 second response time [17:12:23] PROBLEM - Check for valid instance states on labnodepool1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:12:43] <_joe_> can someone DOWNTIME that lvs check [17:12:56] <_joe_> it already paged me twice [17:13:08] (03CR) 10Faidon Liambotis: [V: 032 C: 032] icinga: move RIPE Atlas measurements under netops [puppet] - 10https://gerrit.wikimedia.org/r/364208 (owner: 10Faidon Liambotis) [17:13:10] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs2003.codfw.wmnet [17:13:12] RainbowSprinkles: let me know if you need a review/patch written [17:13:18] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs2002.codfw.wmnet [17:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:28] RECOVERY - LVS HTTP IPv4 on wdqs.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 13048 bytes in 0.073 second response time [17:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:30] SMalyshev: ^ those are now depooled, when you are done just ping me [17:13:31] (03PS4) 10Faidon Liambotis: icinga: move RIPE Atlas measurements under netops [puppet] - 10https://gerrit.wikimedia.org/r/364208 [17:13:33] and i'll repool [17:13:34] (03CR) 10Faidon Liambotis: [V: 032 C: 032] icinga: move RIPE Atlas measurements under netops [puppet] - 10https://gerrit.wikimedia.org/r/364208 (owner: 10Faidon Liambotis) [17:13:38] that fixes the alert that went out =] [17:13:58] jdlrobson: Actually, I need to add to wmf.7 and wmf.9 [17:14:02] pybal just doesnt automatically depool 2 of 3 hosts in a pool [17:14:07] 10Operations, 10Cassandra, 10Mobile-Content-Service, 10Reading-Infrastructure-Team-Backlog, 10Services (attic): mobileapps 500s following reboot of restbase1007 - https://phabricator.wikimedia.org/T138314#3431487 (10GWicke) This looks out of date. @mobrovac, time to close it? [17:14:10] So we'll want to merge to master, then create 2 branches [17:14:11] robh: thank you, will do. Probably will take a day or so to do full load [17:14:23] no worries, im on clinic duty all week =] [17:14:33] and this way no more oddball requests are routed to down servers [17:14:41] no clue how well the wikidata query service itself handles that [17:14:44] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [17:15:11] (03PS4) 10Elukey: Clone wikistats v2 repository and link it to v2 [puppet] - 10https://gerrit.wikimedia.org/r/362118 (https://phabricator.wikimedia.org/T167684) (owner: 10Milimetric) [17:15:22] !log labcontrol1001:~# service rabbitmq-server restart [17:15:24] what's up with wdqs? [17:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:44] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 0.006 second response time [17:15:54] akosiaris_: SMalyshev is reimaging 2 of the 3 hosts in codfw [17:16:03] PROBLEM - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0] [17:16:05] and they werent depooled in pybal ahead of time, so when they went down pybal complained [17:16:06] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Create "network" icinga group - https://phabricator.wikimedia.org/T167279#3431501 (10faidon) 05Open>03Resolved a:03faidon Done :) [17:16:11] since 2/3rds of the pool was gone [17:16:31] the two down systems are now depooled properly, and wdqs can handle the load on the 3 eqiad + 1 codfw (i checked with SMalyshev) [17:16:32] RainbowSprinkles: shall i hit +2 on the Minerva patch then? [17:16:38] Yes [17:16:42] ah ok, cool, thanks for the info [17:16:45] welcome =] [17:16:52] thanks for following up on being paged! [17:17:13] RainbowSprinkles: it's going through CI now [17:17:23] It'll take awhile, CI's a little backed up [17:17:33] PROBLEM - LVS HTTP IPv4 on recommendation-api.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.37 and port 5252: Connection refused [17:17:58] <_joe_> uh? [17:18:02] someoen already claimed the recommendation thing right? [17:18:03] PROBLEM - LVS HTTP IPv4 on recommendation-api.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.37 and port 5252: Connection refused [17:18:08] im trying to find in backscroll but its a LOT [17:18:10] <_joe_> wjy port 5252? [17:18:12] <_joe_> wtf? [17:18:21] <_joe_> ok I'm coming back online I guess [17:18:42] _joe_: you thought you could actually stand up and walk around huh? nope! =[ [17:19:05] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs over the host-BMC interface - https://phabricator.wikimedia.org/T169360#3431516 (10faidon) [17:19:12] !log labnet1001:~# service nova-api restart [17:19:17] <_joe_> robh: this is actually my fault, meh [17:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:36] 10Operations, 10MediaWiki-General-or-Unknown, 10Traffic, 10Services (watching): Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#3431519 (10GWicke) [17:20:19] 10Operations, 10Performance-Team, 10Availability (Multiple-active-datacenters), 10Services (blocked): Consider REST with SSL (HyperSwitch/Cassandra) for session storage - https://phabricator.wikimedia.org/T134811#3431525 (10GWicke) [17:20:34] is jenkins/zuul piling up requests? https://integration.wikimedia.org/zuul/ [17:20:46] (03PS1) 10Giuseppe Lavagetto: lvs::configuration: fix port for icinga check on recommendation-api [puppet] - 10https://gerrit.wikimedia.org/r/364787 [17:21:37] elukey looks like that for me too. I see no tests running. [17:21:41] <_joe_> !log rolling restart of pybal on low-traffic LVS in eqiad,codfw [17:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:54] elukey: we are looking into soemthign maybe related [17:22:18] PROBLEM - WDQS HTTP on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.073 second response time [17:22:26] 10Blocked-on-Operations, 10Operations, 10Services, 10monitoring: Update restbase catchpoint metric - https://phabricator.wikimedia.org/T137181#3431537 (10GWicke) 05Open>03Resolved [17:22:51] 10Operations, 10monitoring: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3431545 (10faidon) [17:22:55] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs over the host-BMC interface - https://phabricator.wikimedia.org/T169360#3431540 (10faidon) 05Open>03Resolved a:03faidon So it seems like the remaining ones are: - labsdb100{1,3}: Ciscos, ignore (T142807) - mw1196... [17:23:02] 10Operations, 10Cassandra, 10RESTBase-Cassandra, 10Patch-For-Review, 10Services (attic): Evaluate Brotli compression for Cassandra - https://phabricator.wikimedia.org/T125906#3431546 (10GWicke) [17:23:08] PROBLEM - WDQS HTTP on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.072 second response time [17:23:55] hmm why icinga is complaining now? it should be in maintenance [17:24:05] 10Operations, 10monitoring: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#2775707 (10faidon) 05Open>03Resolved All listed here and most of the T169360's are fixed now. What isn't fixed is due to hardware troubles that is tracked separately (and it's just 5 now, i... [17:24:08] PROBLEM - WDQS SPARQL on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.072 second response time [17:24:15] chasemp: maybe only a huge queue? https://integration.wikimedia.org/zuul/ [17:24:28] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - recommendation-api_9632 - Could not depool server scb1002.eqiad.wmnet because of too many down! [17:24:30] 10Operations, 10ArchCom-RfC, 10ArchCom-Has-shepherd, 10RfC, 10Services (attic): Service Ownership and Maintenance - https://phabricator.wikimedia.org/T122825#3431558 (10GWicke) [17:24:42] chasemp: there are jobs queued since 45 mins ago :D [17:24:56] 10Operations, 10Security-General, 10Services (attic): Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#3431563 (10GWicke) [17:25:58] PROBLEM - PyBal backends health check on lvs1010 is CRITICAL: PYBAL CRITICAL - recommendation-api_9632 - Could not depool server scb1004.eqiad.wmnet because of too many down! [17:26:36] <_joe_> mobrovac: the bug makes the checks fail on pybal [17:27:35] 10Operations, 10Traffic, 10monitoring, 10Patch-For-Review: Performance impact evaluation of enabling nginx-lua and nginx-lua-prometheus on tlsproxy - https://phabricator.wikimedia.org/T161101#3121488 (10faidon) @Ema, it seems like the task as described has been completed (awesome work and great presentatio... [17:28:10] <_joe_> sigh [17:28:24] <_joe_> I'll have to ack the alerts everywhere [17:29:16] !log labnodepool1001:~# service nodepool stop [17:29:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:36] 10Operations, 10Performance-Team, 10User-Elukey, 10Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3431623 (10aaron) >>! In T125735#3421595, @elukey wrote: > @aaron any chance in your o... [17:30:38] (03PS2) 10Giuseppe Lavagetto: lvs::configuration: fix port for icinga check on recommendation-api [puppet] - 10https://gerrit.wikimedia.org/r/364787 [17:31:53] RainbowSprinkles: still waiting for the merge :) [17:32:18] Yeah, nodepool is busted [17:32:27] See #-releng / #-cloud [17:32:42] RainbowSprinkles: just saw your comment in releng [17:33:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10Services (watching), 10User-fgiunchedi: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3431630 (10Cmjohnson) [17:33:32] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] lvs::configuration: fix port for icinga check on recommendation-api [puppet] - 10https://gerrit.wikimedia.org/r/364787 (owner: 10Giuseppe Lavagetto) [17:33:51] _joe_ need any help? [17:34:15] <_joe_> elukey: not for now, thanks [17:34:28] PROBLEM - nodepoold running on labnodepool1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [17:34:38] PROBLEM - Check systemd state on labnodepool1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:37:34] (03PS1) 10Giuseppe Lavagetto: pybal: remove proxyfetch from recommendation-api temporarily [puppet] - 10https://gerrit.wikimedia.org/r/364791 (https://phabricator.wikimedia.org/T170439) [17:38:23] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] pybal: remove proxyfetch from recommendation-api temporarily [puppet] - 10https://gerrit.wikimedia.org/r/364791 (https://phabricator.wikimedia.org/T170439) (owner: 10Giuseppe Lavagetto) [17:40:19] RECOVERY - PyBal backends health check on lvs1010 is OK: PYBAL OK - All pools are healthy [17:40:26] <_joe_> much better :P [17:40:53] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [17:42:23] PROBLEM - WDQS SPARQL on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.073 second response time [17:42:23] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /translation/articles/{source}/{target}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 404 (expecting: 200): /translation/articles/{source}/{target}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200): / [17:42:23] s/{source}/{target}{/seed} (bad source) is CRITICAL: Test bad source returned the unexpected status 404 (expecting: 504) [17:42:23] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /translation/articles/{source}/{target}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 404 (expecting: 200): /translation/articles/{source}/{target}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200): / [17:42:23] s/{source}/{target}{/seed} (bad source) is CRITICAL: Test bad source returned the unexpected status 404 (expecting: 504) [17:43:03] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [17:43:12] robh: icinga keeps dropping my scheduled downtime for wdqs2002/2003 - what's up with that? [17:43:27] I enter it and in 10 minutes there's no sign of it [17:43:42] PROBLEM - WDQS HTTP on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.072 second response time [17:44:17] <_joe_> uhm I did ack those alerts [17:44:20] <_joe_> grr [17:44:43] PROBLEM - WDQS SPARQL on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.074 second response time [17:44:48] (03PS1) 10Cmjohnson: Adding production dns for restbase-dev100[4-6] T166181 [dns] - 10https://gerrit.wikimedia.org/r/364792 [17:45:02] RECOVERY - Check systemd state on labnodepool1001 is OK: OK - running: The system is fully operational [17:45:33] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /translation/articles/{source}/{target}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 404 (expecting: 200): /translation/articles/{source}/{target}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200): / [17:45:33] s/{source}/{target}{/seed} (bad source) is CRITICAL: Test bad source returned the unexpected status 404 (expecting: 504) [17:45:56] (03CR) 10Cmjohnson: [C: 032] Adding production dns for restbase-dev100[4-6] T166181 [dns] - 10https://gerrit.wikimedia.org/r/364792 (owner: 10Cmjohnson) [17:46:02] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [17:46:52] RECOVERY - nodepoold running on labnodepool1001 is OK: PROCS OK: 1 process with UID = 113 (nodepool), regex args ^/usr/bin/python /usr/bin/nodepoold -d [17:46:53] ACKNOWLEDGEMENT - recommendation_api endpoints health on scb2002 is CRITICAL: /translation/articles/{source}/{target}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 404 (expecting: 200): /translation/articles/{source}/{target}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: [17:46:53] /articles/{source}/{target}{/seed} (bad source) is CRITICAL: Test bad source returned the unexpected status 404 (expecting: 504) Giuseppe Lavagetto flapping, already working on a fix. T170439 [17:46:53] ACKNOWLEDGEMENT - recommendation_api endpoints health on scb2003 is CRITICAL: /translation/articles/{source}/{target}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 404 (expecting: 200): /translation/articles/{source}/{target}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: [17:46:53] /articles/{source}/{target}{/seed} (bad source) is CRITICAL: Test bad source returned the unexpected status 404 (expecting: 504) Giuseppe Lavagetto flapping, already working on a fix. T170439 [17:46:53] ACKNOWLEDGEMENT - recommendation_api endpoints health on scb2005 is CRITICAL: /translation/articles/{source}/{target}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 404 (expecting: 200): /translation/articles/{source}/{target}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: [17:46:54] /articles/{source}/{target}{/seed} (bad source) is CRITICAL: Test bad source returned the unexpected status 404 (expecting: 504) Giuseppe Lavagetto flapping, already working on a fix. T170439 [17:47:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3431708 (10Cmjohnson) [17:47:39] _joe_: do you know maybe why icinga is dropping my downtime schedule? [17:47:58] <_joe_> SMalyshev: it's a known bug we're experiencing lately [17:48:10] <_joe_> SMalyshev: multiple people are looking into it [17:48:12] ah I see. Any workarounds? [17:48:12] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[r-cran-rmysql] [17:48:13] 10Operations, 10ops-eqiad: mgmt inaccessible on restbase1018 - https://phabricator.wikimedia.org/T169871#3431713 (10Cmjohnson) 05Open>03Resolved Resolved this with the system reboot [17:48:25] <_joe_> nope [17:48:35] ok, I'll just ignore it then, thanks! [17:48:36] * _joe_ off [17:49:02] PROBLEM - salt-minion processes on stat1006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:49:02] PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:49:02] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /translation/articles/{source}/{target}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 404 (expecting: 200): /translation/articles/{source}/{target}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200): / [17:49:02] s/{source}/{target}{/seed} (bad source) is CRITICAL: Test bad source returned the unexpected status 404 (expecting: 504) [17:51:52] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /translation/articles/{source}/{target}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 404 (expecting: 200): /translation/articles/{source}/{target}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200): / [17:51:54] s/{source}/{target}{/seed} (bad source) is CRITICAL: Test bad source returned the unexpected status 404 (expecting: 504) [17:52:26] !log labnodepool1001:~# sudo puppet agent --enable [17:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:43] PROBLEM - WDQS HTTP on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.073 second response time [17:53:00] to be clear: CI is back running jobs again after the Cloud fix [17:53:27] ACKNOWLEDGEMENT - recommendation_api endpoints health on scb2004 is CRITICAL: /translation/articles/{source}/{target}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 404 (expecting: 200): /translation/articles/{source}/{target}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: [17:53:27] /articles/{source}/{target}{/seed} (bad source) is CRITICAL: Test bad source returned the unexpected status 404 (expecting: 504) Giuseppe Lavagetto T170439 [17:53:27] ACKNOWLEDGEMENT - recommendation_api endpoints health on scb2006 is CRITICAL: /translation/articles/{source}/{target}{/seed} (normal source and target) is CRITICAL: Test normal source and target returned the unexpected status 404 (expecting: 200): /translation/articles/{source}/{target}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: [17:53:27] /articles/{source}/{target}{/seed} (bad source) is CRITICAL: Test bad source returned the unexpected status 404 (expecting: 504) Giuseppe Lavagetto T170439 [17:53:32] something something AZs [17:54:20] 10Operations, 10Mobile-Content-Service, 10Parsing-Team, 10Reading-Infrastructure-Team-Backlog, and 4 others: Create functional cluster checks for all services (and have them page!) - https://phabricator.wikimedia.org/T134551#3431762 (10GWicke) @Joe, has this been resolved by the external checks (including... [17:55:02] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [17:57:55] 10Operations, 10Citoid, 10Services, 10VisualEditor, 10User-Ryasmeen: NIH db misbehaviour causing problems to Citoid - https://phabricator.wikimedia.org/T133696#3431772 (10GWicke) 05stalled>03Resolved a:03GWicke It sounds like there is nothing left to do here. NIH db outages will always affect citoi... [17:58:00] (03PS1) 10Cmjohnson: Adding production and mgmt ip address for labnet1003/1004 T165779 [dns] - 10https://gerrit.wikimedia.org/r/364795 [17:58:38] (03CR) 10Cmjohnson: [C: 032] Adding production and mgmt ip address for labnet1003/1004 T165779 [dns] - 10https://gerrit.wikimedia.org/r/364795 (owner: 10Cmjohnson) [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170712T1800). Please do the needful. [18:00:05] Urbanecm: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:12] Present [18:01:12] (03CR) 10Chad: [C: 032] Fix logos for srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364754 (https://phabricator.wikimedia.org/T168444) (owner: 10Urbanecm) [18:01:29] Urbanecm: It's gonna take awhile to merge [18:01:31] CI was backed up [18:02:07] RainbowSprinkles: ack [18:05:16] greg-g: fwiw work was happening to create a second AZ (logical thing) and suddenly openstack made it default and that was bad [18:05:22] sorted by random ID seems like :) [18:05:34] and none of our $things expliclity select the "default" now [18:05:48] chasemp: heh, love random sort [18:06:53] (03PS1) 10Andrew Bogott: Add AAAA for labtestpuppetmaster2001 [dns] - 10https://gerrit.wikimedia.org/r/364798 [18:07:17] 10Operations, 10Icinga, 10Services, 10monitoring: create service/user groups in icinga - https://phabricator.wikimedia.org/T107884#3431856 (10GWicke) 05Open>03Resolved a:03GWicke [18:07:20] 10Operations, 10Ops-Access-Requests, 10Icinga, 10Services, 10monitoring: give services team permissions to send commands in icinga - https://phabricator.wikimedia.org/T105228#3431858 (10GWicke) [18:07:32] (03CR) 10Andrew Bogott: "Do we need a reverse record someplace too?" [dns] - 10https://gerrit.wikimedia.org/r/364798 (owner: 10Andrew Bogott) [18:11:22] RECOVERY - salt-minion processes on ms-be1034 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [18:12:02] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [18:12:38] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labnet100[34] - https://phabricator.wikimedia.org/T165779#3431876 (10Cmjohnson) [18:13:03] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labnet100[34] - https://phabricator.wikimedia.org/T165779#3276633 (10Cmjohnson) @robh can you take this from here please. [18:13:09] (03CR) 10Chad: [V: 032 C: 032] Fix logos for srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364754 (https://phabricator.wikimedia.org/T168444) (owner: 10Urbanecm) [18:13:26] (03CR) 10jenkins-bot: Fix logos for srwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364754 (https://phabricator.wikimedia.org/T168444) (owner: 10Urbanecm) [18:13:34] (03PS1) 10Andrew Bogott: Nova: Depool labvirt1015 for now [puppet] - 10https://gerrit.wikimedia.org/r/364801 [18:13:43] 10Operations, 10Goal, 10Kubernetes: Operations Q1 goal: Streamlined Service Delivery - https://phabricator.wikimedia.org/T170108#3431905 (10GWicke) [18:13:59] 10Operations, 10Goal, 10Kubernetes: Operations Q1 goal: Streamlined Service Delivery - https://phabricator.wikimedia.org/T170108#3419676 (10GWicke) [18:14:03] (03CR) 10Andrew Bogott: [V: 032 C: 032] Nova: Depool labvirt1015 for now [puppet] - 10https://gerrit.wikimedia.org/r/364801 (owner: 10Andrew Bogott) [18:14:07] (03PS2) 10Andrew Bogott: Nova: Depool labvirt1015 for now [puppet] - 10https://gerrit.wikimedia.org/r/364801 [18:14:12] (03CR) 10Andrew Bogott: [V: 032 C: 032] Nova: Depool labvirt1015 for now [puppet] - 10https://gerrit.wikimedia.org/r/364801 (owner: 10Andrew Bogott) [18:14:23] !log demon@tin Synchronized static/images/project-logos/: Fixing srwikiquote logos (duration: 00m 48s) [18:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:38] Urbanecm: Forced yours through, you're live now ^^ [18:14:45] 10Operations, 10Deployment-Systems, 10Services (attic): Evaluate Docker as a container deployment tool - https://phabricator.wikimedia.org/T93439#3431909 (10GWicke) We are working on this as part of {T170453}. [18:14:46] RainbowSprinkles: Thank you! [18:15:08] !log depooling labvirt1015, deleting a bunch of stuck contintcloud instances [18:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:20] RainbowSprinkles: I still get the old ones [18:15:20] 10Operations, 10Release-Engineering-Team, 10Services (watching): 2017/18 Program 6: Streamlined Service delivery - https://phabricator.wikimedia.org/T170453#3431889 (10GWicke) [18:15:27] Prolly cached? [18:15:27] greg-g: seems like things are still odd or bad, andrewbogott is looking now [18:15:33] 10Operations, 10Release-Engineering-Team, 10Services (watching): 2017/18 Program 6: Streamlined Service delivery - https://phabricator.wikimedia.org/T170453#3431889 (10GWicke) p:05Triage>03Normal [18:15:37] !log mobrovac@tin Started deploy [recommendation-api/deploy@7fd10f2]: Use the domain parameter as the target language - T170439 [18:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:48] T170439: Recommendation API needs to take a domain as a parameter - https://phabricator.wikimedia.org/T170439 [18:15:54] RainbowSprinkles: If so, server-side cached. [18:16:01] 10Operations, 10Release-Engineering-Team, 10Services (watching): FY2017/18 Program 6: Streamlined Service delivery - https://phabricator.wikimedia.org/T170453#3431917 (10GWicke) [18:16:02] Yeah, varnish [18:16:05] Lemme purge the urls [18:16:17] !log mobrovac@tin Finished deploy [recommendation-api/deploy@7fd10f2]: Use the domain parameter as the target language - T170439 (duration: 00m 40s) [18:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:29] thx [18:17:18] Purged all 3 from varnish [18:18:12] chasemp: :( nodepool is back to not OK :( [18:18:48] RainbowSprinkles: It still seems like old files. [18:19:36] Well I promise it merged and sync. It's gotta be stuck somewhere :) [18:21:03] : [18:21:05] :) [18:21:22] (03PS1) 10Andrew Bogott: Nova: Depool labvirt1014 as well. [puppet] - 10https://gerrit.wikimedia.org/r/364805 [18:21:54] (03CR) 10Andrew Bogott: [V: 032 C: 032] Nova: Depool labvirt1014 as well. [puppet] - 10https://gerrit.wikimedia.org/r/364805 (owner: 10Andrew Bogott) [18:23:27] RainbowSprinkles: And client's cache shouldn't cause that, see https://pastebin.com/fPqBtXBC [18:24:52] (03PS1) 10Ottomata: Move more stuff into profile::statistics::private [puppet] - 10https://gerrit.wikimedia.org/r/364806 (https://phabricator.wikimedia.org/T152712) [18:25:22] Hmm, same [18:25:29] (03PS2) 10Ottomata: Move more stuff into profile::statistics::private [puppet] - 10https://gerrit.wikimedia.org/r/364806 (https://phabricator.wikimedia.org/T152712) [18:26:23] (03PS3) 10Ottomata: Move more stuff into profile::statistics::private [puppet] - 10https://gerrit.wikimedia.org/r/364806 (https://phabricator.wikimedia.org/T152712) [18:26:28] The image served by varnish and the one I've submitted is simply different. Do you know why? [18:28:36] (03PS4) 10Ottomata: Move more stuff into profile::statistics::private [puppet] - 10https://gerrit.wikimedia.org/r/364806 (https://phabricator.wikimedia.org/T152712) [18:29:04] RainbowSprinkles: ^^ [18:29:57] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labnet100[34] - https://phabricator.wikimedia.org/T165779#3431947 (10RobH) a:05Cmjohnson>03RobH Yep! [18:30:11] SMalyshev: sorry was afk for abit [18:30:11] (03PS1) 10Cmjohnson: Adding production and dns entries for labnodepool1002 T168407 [dns] - 10https://gerrit.wikimedia.org/r/364807 [18:30:24] so there is a known issue where icinga acks go away and we're (operations) is still trying to find out why [18:30:29] robh: no problem, just icinga misbehaving :) [18:30:30] (03CR) 10jerkins-bot: [V: 04-1] Adding production and dns entries for labnodepool1002 T168407 [dns] - 10https://gerrit.wikimedia.org/r/364807 (owner: 10Cmjohnson) [18:30:49] (03CR) 10Ottomata: [V: 032 C: 032] "no op https://puppet-compiler.wmflabs.org/7041/" [puppet] - 10https://gerrit.wikimedia.org/r/364806 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [18:30:51] I'm just ignoring it for now :) [18:31:50] Urbanecm: I do not know. [18:31:54] I merged what you submitted [18:31:56] And sync'd it [18:31:59] Nothing special [18:32:22] (03PS2) 10Cmjohnson: Adding production and dns entries for labnodepool1002 T168407 [dns] - 10https://gerrit.wikimedia.org/r/364807 [18:33:39] Hmm [18:37:40] 10Operations, 10Release-Engineering-Team, 10Epic, 10Services (doing): Streamlined Service delivery Objective 3: Integrated, container-based development environment - https://phabricator.wikimedia.org/T170456#3431976 (10GWicke) [18:38:06] (03PS1) 10Ottomata: Apply role statistics::private_new to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/364808 (https://phabricator.wikimedia.org/T152712) [18:38:10] 10Operations, 10Release-Engineering-Team, 10Epic, 10Services (watching): FY2017/18 Program 6: Streamlined Service delivery - https://phabricator.wikimedia.org/T170453#3431889 (10GWicke) [18:38:56] 10Operations, 10Release-Engineering-Team, 10Epic, 10Services (doing): Streamlined Service delivery Objective 3: Integrated, container-based development environment - https://phabricator.wikimedia.org/T170456#3431976 (10GWicke) [18:39:16] 10Operations, 10Release-Engineering-Team, 10Epic, 10Services (doing): Streamlined Service delivery Objective 3: Integrated, container-based development environment - https://phabricator.wikimedia.org/T170456#3431976 (10GWicke) [18:39:44] 10Operations, 10Release-Engineering-Team, 10Epic, 10Services (doing): Streamlined Service delivery Objective 3: Integrated, container-based development environment - https://phabricator.wikimedia.org/T170456#3431996 (10GWicke) [18:40:15] 10Operations, 10Release-Engineering-Team, 10Epic, 10Services (doing): Streamlined Service delivery Objective 3: Integrated, container-based development environment - https://phabricator.wikimedia.org/T170456#3431976 (10GWicke) [18:40:30] (03CR) 10Cmjohnson: [C: 032] Adding production and dns entries for labnodepool1002 T168407 [dns] - 10https://gerrit.wikimedia.org/r/364807 (owner: 10Cmjohnson) [18:41:04] (03PS2) 10Ottomata: Apply role statistics::private_new to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/364808 (https://phabricator.wikimedia.org/T152712) [18:42:27] 10Operations, 10MediaWiki-Containers, 10Release-Engineering-Team, 10Epic, and 3 others: Streamlined Service delivery Objective 3: Integrated, container-based development environment - https://phabricator.wikimedia.org/T170456#3432001 (10mobrovac) [18:42:32] (03PS3) 10Ottomata: Apply role statistics::private_new to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/364808 (https://phabricator.wikimedia.org/T152712) [18:42:45] (03CR) 10Ottomata: "no op on stat1002" [puppet] - 10https://gerrit.wikimedia.org/r/364808 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [18:42:48] (03CR) 10Ottomata: [V: 032 C: 032] Apply role statistics::private_new to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/364808 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [18:46:53] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407#3432010 (10Cmjohnson) [18:47:21] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407#3363615 (10Cmjohnson) a:05Cmjohnson>03RobH Moving this @robh to handle the off-site work. [18:47:52] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 8 [18:48:52] PROBLEM - Disk space on stat1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:48:53] PROBLEM - DPKG on stat1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:49:42] RECOVERY - Disk space on stat1005 is OK: DISK OK [18:49:47] 10Operations, 10Release-Engineering-Team, 10Epic, 10Services (watching): FY2017/18 Program 6: Streamlined Service delivery - https://phabricator.wikimedia.org/T170453#3432015 (10greg) [18:49:52] RECOVERY - DPKG on stat1005 is OK: All packages OK [18:50:12] 10Operations, 10MediaWiki-Containers, 10Release-Engineering-Team, 10Epic, and 3 others: Streamlined Service Delivery: Outcome 2, Objective 3: Integrated, container-based development environment - https://phabricator.wikimedia.org/T170456#3432016 (10greg) [18:50:24] (03CR) 10Mforns: "LGTM! Thanks for taking care of this :]" [puppet] - 10https://gerrit.wikimedia.org/r/364743 (owner: 10Elukey) [18:51:02] RECOVERY - MariaDB Slave Lag: s3 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89954.51 seconds [18:51:21] (03PS1) 10Ottomata: Add groups to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/364812 (https://phabricator.wikimedia.org/T152712) [18:51:37] (03CR) 10Ottomata: [V: 032 C: 032] Add groups to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/364812 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [18:52:00] (03CR) 10Mforns: "LGTM! Sorry for being slow, and reviewing only post-merge" [puppet] - 10https://gerrit.wikimedia.org/r/364701 (owner: 10Elukey) [18:52:12] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:53:30] (03PS1) 10Ottomata: Remove statistics-privatedata-users from stat1006.yaml (cruncher) [puppet] - 10https://gerrit.wikimedia.org/r/364813 (https://phabricator.wikimedia.org/T152712) [18:53:42] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/mysql/conf.d/statistics-private-client.cnf] [18:53:43] (03CR) 10Ottomata: [V: 032 C: 032] Remove statistics-privatedata-users from stat1006.yaml (cruncher) [puppet] - 10https://gerrit.wikimedia.org/r/364813 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [18:54:52] RECOVERY - salt-minion processes on stat1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:55:22] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [18:55:42] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [18:55:45] (03PS1) 10Ottomata: AH, yes, statistics-privatedata-users should be on cruncher, it is a superset of perms [puppet] - 10https://gerrit.wikimedia.org/r/364814 (https://phabricator.wikimedia.org/T152712) [18:55:51] (03CR) 10Ottomata: [V: 032 C: 032] AH, yes, statistics-privatedata-users should be on cruncher, it is a superset of perms [puppet] - 10https://gerrit.wikimedia.org/r/364814 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [18:58:02] PROBLEM - salt-minion processes on stat1006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [18:59:31] (03PS1) 10Ottomata: Use conditionals instead of new role files to deal with stat box migration [puppet] - 10https://gerrit.wikimedia.org/r/364817 (https://phabricator.wikimedia.org/T152712) [19:00:04] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170712T1900). Please do the needful. [19:00:58] * thcipriani acks [19:03:51] (03PS1) 10Cmjohnson: adding mgmt ip address for labweb1001/1002 T167820 [dns] - 10https://gerrit.wikimedia.org/r/364820 [19:04:31] (03CR) 10Cmjohnson: [C: 032] adding mgmt ip address for labweb1001/1002 T167820 [dns] - 10https://gerrit.wikimedia.org/r/364820 (owner: 10Cmjohnson) [19:05:29] (03CR) 10Ottomata: "Total no op https://puppet-compiler.wmflabs.org/7045/" [puppet] - 10https://gerrit.wikimedia.org/r/364817 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [19:05:37] (03CR) 10Ottomata: [V: 032 C: 032] Use conditionals instead of new role files to deal with stat box migration [puppet] - 10https://gerrit.wikimedia.org/r/364817 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [19:07:13] (03PS1) 10Ottomata: Run reportupdater::jobs::hadoop from stat1005 instead of stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/364823 (https://phabricator.wikimedia.org/T152712) [19:07:35] (03CR) 10Ottomata: [V: 032 C: 032] Run reportupdater::jobs::hadoop from stat1005 instead of stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/364823 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [19:09:45] !log demon@tin Synchronized php-1.30.0-wmf.7/skins/MinervaNeue/: Latest code (duration: 00m 48s) [19:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:40] !log demon@tin Synchronized php-1.30.0-wmf.9/skins/MinervaNeue/: Latest code (duration: 00m 47s) [19:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:25] (03PS1) 10Ottomata: Move refinery::job::data_check to stat1005 from stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/364826 (https://phabricator.wikimedia.org/T152712) [19:14:33] (03CR) 10Chad: [C: 031] "Can we just merge this thing already?" [puppet] - 10https://gerrit.wikimedia.org/r/255958 (owner: 10Reedy) [19:14:42] (03CR) 10Ottomata: [V: 032 C: 032] Move refinery::job::data_check to stat1005 from stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/364826 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [19:14:46] (03PS1) 10Herron: Add clamav to lists for malware scanning [puppet] - 10https://gerrit.wikimedia.org/r/364827 (https://phabricator.wikimedia.org/T170462) [19:18:16] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labweb100[12].wikimedia.org - https://phabricator.wikimedia.org/T167820#3432194 (10Cmjohnson) [19:19:06] 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10Availability (Multiple-active-datacenters), and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3432209 (10Krinkle) [19:19:21] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labweb100[12].wikimedia.org - https://phabricator.wikimedia.org/T167820#3345535 (10Cmjohnson) a:05Cmjohnson>03RobH @robh assigning to you...can you do the production DNS please in addition to the other things. Thanks [19:19:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3432235 (10RobH) a:05Cmjohnson>03Eevans I've asked in #wikimedia-services, but IRC is not a permanent medium, so I'll also ask here. @eevans alread... [19:20:20] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Cloud-VPS: rack/setup/install labpuppetmaster100[12].wikimedia.org - https://phabricator.wikimedia.org/T167905#3432245 (10Cmjohnson) a:05Cmjohnson>03RobH Assigning to robh to do off-site work [19:20:53] 10Operations, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3432251 (10Cmjohnson) [19:21:21] 10Operations, 10Cloud-Services: rack/setup/install labcontrol100[34] - https://phabricator.wikimedia.org/T165781#3432252 (10Cmjohnson) [19:22:29] 10Operations: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3432264 (10Cmjohnson) [19:23:11] (03PS1) 10Thcipriani: group1 wikis to 1.30.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364828 [19:23:15] (03CR) 10Thcipriani: [C: 032] group1 wikis to 1.30.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364828 (owner: 10Thcipriani) [19:24:08] (03CR) 10Herron: "Compiler results: https://puppet-compiler.wmflabs.org/7046/" [puppet] - 10https://gerrit.wikimedia.org/r/364827 (https://phabricator.wikimedia.org/T170462) (owner: 10Herron) [19:24:36] 10Operations, 10Release-Engineering-Team (Backlog), 10Services (later), 10Wikimedia-Incident: Review new service 'pre-deployment to production' checklist - https://phabricator.wikimedia.org/T141897#3432284 (10GWicke) [19:24:50] (03Merged) 10jenkins-bot: group1 wikis to 1.30.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364828 (owner: 10Thcipriani) [19:25:35] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.30.0-wmf.9 [19:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:06] (03CR) 10jenkins-bot: group1 wikis to 1.30.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364828 (owner: 10Thcipriani) [19:27:30] 10Operations, 10ArchCom-RfC, 10Performance-Team, 10Traffic, and 5 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#3432296 (10GWicke) [19:27:39] !log thcipriani@tin Synchronized php: promote php symlink group1 wikis to 1.30.0-wmf.9 (duration: 00m 45s) [19:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:29] 10Operations, 10ops-eqiad, 10Performance-Team, 10Thumbor, 10User-fgiunchedi: Rename mw1236 / mw1237 to thumbor1003 / thumbor1004 - https://phabricator.wikimedia.org/T168297#3432304 (10Cmjohnson) 05Open>03Resolved done, racktables updated [19:28:42] 10Operations, 10RESTBase, 10Services (attic): RESTBase and domain renames - https://phabricator.wikimedia.org/T113307#3432307 (10GWicke) [19:30:38] 10Operations, 10Mathoid, 10Salt, 10Trebuchet, 10Services (attic): Remove sca100x from the list of Mathoid's minions - https://phabricator.wikimedia.org/T129645#3432312 (10GWicke) This sounds like it might be resolved. @arielglenn @mobrovac, could you verify? [19:31:48] 10Operations, 10Mathoid, 10Salt, 10Trebuchet, 10Services (done): Remove sca100x from the list of Mathoid's minions - https://phabricator.wikimedia.org/T129645#3432314 (10mobrovac) 05Open>03Resolved Yup, this hasn't been a problem for a while, resolving. [19:33:16] 10Operations, 10ops-eqiad, 10Services (watching): restbase-dev1003 stuck after reboot - https://phabricator.wikimedia.org/T169696#3432331 (10Cmjohnson) @Eevans restbase-dev1003 appears to be up on my end and mgmt is accessible...output from console Debian GNU/Linux 8 restbase-dev1003 ttyS1 restbase-dev100... [19:37:03] !log adding ignore-l3-incompletes to all peering/transit interfaces - T163542 [19:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:15] T163542: Interface errors on cr2-eqiad:xe-4/3/1 - https://phabricator.wikimedia.org/T163542 [19:39:29] (03PS1) 10Ottomata: Rsync MW API logs to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/364829 (https://phabricator.wikimedia.org/T152712) [19:40:21] (03CR) 10Ottomata: [V: 032 C: 032] Rsync MW API logs to stat1005 [puppet] - 10https://gerrit.wikimedia.org/r/364829 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [19:42:44] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [19:48:12] 10Operations, 10Traffic, 10Performance, 10Services (later): Look into solutions for replaying traffic to testing environment(s) - https://phabricator.wikimedia.org/T129682#3432378 (10GWicke) [19:49:04] 10Operations, 10netops: Interface errors on cr2-eqiad:xe-4/3/1 - https://phabricator.wikimedia.org/T163542#3432382 (10ayounsi) 05Open>03Resolved Did some more troubleshooting on that interface and some others showing regular l3 incomplete. I managed to capture packets coming from various providers, toward... [19:49:18] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#3432385 (10Nuria) [19:49:48] 10Operations, 10Pybal, 10Traffic, 10netops: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3432388 (10ayounsi) p:05Normal>03Triage [19:50:06] 10Operations, 10Pybal, 10Traffic, 10netops: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3205379 (10ayounsi) p:05Triage>03Normal [19:50:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3432393 (10Eevans) >>! In T166181#3432235, @RobH wrote: > I've asked in #wikimedia-services, but IRC is not a permanent medium, so I'll also ask here.... [19:52:14] PROBLEM - MariaDB Slave Lag: s2 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 331.34 seconds [19:54:36] Hi thcipriani. Did the train run today instead of yesterday? [19:55:20] PROBLEM - mysqld processes on db1102 is CRITICAL: PROCS CRITICAL: 3 processes with command name mysqld [19:55:22] Niharika: ran both days -- group0 yesterday, group1 today: https://tools.wmflabs.org/versions/ [19:55:37] thcipriani: Ah, okay. Thanks. [19:55:51] yw :) [19:55:55] Another downtime lost :( [19:56:18] ah so not a real issue? [19:56:28] (got the page) [19:56:47] nope [19:56:58] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Current state and next steps for RESTBase storage - https://phabricator.wikimedia.org/T152724#3432434 (10GWicke) [19:57:15] WHy did it page actually if according to icinga alerts are disabled? [19:57:54] I dunno but it was db1102 and it whined [19:58:20] PROBLEM - MariaDB Slave Lag: s7 on db1041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.60 seconds [19:58:22] 10Operations, 10RESTBase, 10service-runner, 10Patch-For-Review, and 2 others: enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#3432439 (10GWicke) @mobrovac, is this done? [19:58:36] Definitely downtimes are getting lost [19:58:45] db1041 paged also [19:58:49] because db1041 has been disabled for a looong time [19:58:49] <_joe_> yes [19:58:51] <_joe_> sigh [19:58:53] yeah i got it too [19:59:08] :( [19:59:12] <_joe_> we really ought to find out wth is going on [19:59:28] it is happenning a lot more lately [19:59:52] does seem like it [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170712T2000). [20:00:20] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3432454 (10Marostegui) And happened again now: ``` 19:55 < icinga-wm> PROBLEM - mysqld processes on db1102 is CRITICAL: PROCS CRITICAL: 3 proc... [20:00:56] "notifications disabled" but still paging.. if really true that is different from forgetting downtimes [20:01:01] could the pages be from one icinga server while the downtimes are scheduled on the other icinga server [20:01:33] mutante: yeah, that is what I don't get from db1102.... [20:01:54] Because db1041 looks like a downtime lost and db1102 is disabled and still paged [20:02:40] PROBLEM - MariaDB Slave Lag: s7 on db1041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.00 seconds [20:02:51] confirmed, notifications are still disabled for db1102... [20:02:57] this is new / different [20:03:03] downtimed db1041 again [20:03:11] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3432484 (10Cmjohnson) @jcrespo please remind how you would like the raid setup..Raid10? [20:03:15] no wait, it's not [20:03:27] notifications are just disabled for the host, but not for the services on the host [20:03:31] wow so db1041 just lost its downtime marostegui you put in? [20:03:34] so it works normal for that one [20:04:10] godog: No, that host has been downtimed for a long time already [20:04:41] do we know the expiry for these downtimes? maybe there's a pattern [20:04:42] was the downtime for all services on the host? [20:05:30] mutante: on icinga, on the critical list, it looks like the service is disabled too [20:06:12] But on the general one it doesn't: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=db1102&service=mysqld+processes [20:06:23] But now I don't know if that was like that or something else was lost... [20:06:40] marostegui: ugh, but not here https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=db1102 [20:06:43] 10Operations, 10ops-eqiad, 10Services (watching): restbase-dev1003 stuck after reboot - https://phabricator.wikimedia.org/T169696#3432516 (10RobH) 05Open>03Resolved Looks like this was still offline via networking, so it required a full power drain by onsite and reboot. Its now back. [20:07:06] Yeah...it is weird [20:07:17] I have now disabled it [20:07:19] 10Operations, 10ops-eqiad, 10Services (watching): restbase-dev1003 stuck after reboot - https://phabricator.wikimedia.org/T169696#3432518 (10Cmjohnson) The NIC did not come back after your reboot...I had to power off and reset it. It's backup again. This server is old and worn out and only has 1 good powe... [20:07:24] I think it was disabled before, but I cannot be sure now [20:07:34] ANyways, as this is "solved" I am going back to the sofa :) [20:08:05] heh we'll see if it pages again, thanks mutante marostegui ! [20:08:33] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3432538 (10Marostegui) Yep, RAID10 please: https://wikitech.wikimedia.org/wiki/Raid_and_MegaCli#Raid_setup_at_Wikimedia Thanks! [20:08:50] RECOVERY - MariaDB Slave Lag: s7 on db1041 is OK: OK slave_sql_lag Replication lag: 0.15 seconds [20:08:55] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T169448#3432540 (10Cmjohnson) New disks need to be ordered . A task has been created and escalated to @faidon T170446 [20:09:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3432545 (10RobH) The rsycn should technically be easier, so I'd like to try it out first. Can you detail what data has to be backed up from the hosts,... [20:09:22] page for the recovery for 1041... [20:09:47] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T169448#3432547 (10Marostegui) Thanks a lot @Cmjohnson [20:10:38] 10Operations, 10Cassandra, 10Patch-For-Review, 10Services (blocked): setup/install restbase-dev100[123] - https://phabricator.wikimedia.org/T151075#3432551 (10Cmjohnson) [20:10:41] 10Operations, 10ops-eqiad, 10Cassandra, 10Services (blocked): restbase-test100[13] lost power redundancy - https://phabricator.wikimedia.org/T153248#3432549 (10Cmjohnson) 05Open>03declined not fixing this...new servers are here as replacements. [20:11:00] !log bsitzmann@tin Started deploy [mobileapps/deploy@3f90bf1]: Update mobileapps to d30dae2 (T169930, T170225) [20:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:13] T169930: MCS requests content from RESTBase using non-normalized titles - https://phabricator.wikimedia.org/T169930 [20:11:14] T170225: OnThisDay endpoint skips AD years - https://phabricator.wikimedia.org/T170225 [20:13:29] 10Operations, 10ops-eqiad: Decommisson and store old row D network gear. - https://phabricator.wikimedia.org/T170474#3432587 (10Cmjohnson) [20:13:52] 10Operations, 10netops: cr1-eqiad:ae4 is disabled due to VRRP issue - https://phabricator.wikimedia.org/T149226#3432604 (10Cmjohnson) [20:13:55] (03PS1) 10Ottomata: Move statistics::wmde jobs to stat1005 from stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/364841 (https://phabricator.wikimedia.org/T170471) [20:13:56] 10Operations, 10ops-eqiad, 10netops, 10Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#2762011 (10Cmjohnson) 05Open>03Resolved Resolving this task, I created a subtask for the decom portion. [20:14:17] (03CR) 10Ottomata: [V: 032 C: 032] Move statistics::wmde jobs to stat1005 from stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/364841 (https://phabricator.wikimedia.org/T170471) (owner: 10Ottomata) [20:14:40] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, and 2 others: Option: Consider switching back to leveled compaction (LCS) - https://phabricator.wikimedia.org/T153703#3432609 (10GWicke) a:03Eevans [20:16:00] !log bsitzmann@tin Finished deploy [mobileapps/deploy@3f90bf1]: Update mobileapps to d30dae2 (T169930, T170225) (duration: 05m 00s) [20:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, and 2 others: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3432620 (10Eevans) >>! In T166181#3432545, @RobH wrote: > The rsycn should technically be easier, so I'd like to try it out first. Can you detail what... [20:19:38] 10Operations, 10ops-eqiad, 10netops: eqiad: rack frack refresh equipment - https://phabricator.wikimedia.org/T169644#3432650 (10Cmjohnson) @ayounsi I will start racking the gear this week according to the google doc. Much of what I can remove are cable managers. [20:20:04] (03PS1) 10RobH: setting prod dns for labpuppetmaster100[12].wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/364844 [20:20:18] RECOVERY - MariaDB Slave Lag: s2 on db1047 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [20:21:04] (03CR) 10RobH: [C: 032] setting prod dns for labpuppetmaster100[12].wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/364844 (owner: 10RobH) [20:22:04] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labpuppetmaster100[12].wikimedia.org - https://phabricator.wikimedia.org/T167905#3432656 (10RobH) [20:24:18] (03PS5) 10D3r1ck01: Remove 'din' from wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362876 (https://phabricator.wikimedia.org/T168523) [20:31:14] (03PS1) 10RobH: setting labpuppetmaster100[12] install params [puppet] - 10https://gerrit.wikimedia.org/r/364847 [20:31:57] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labpuppetmaster100[12].wikimedia.org - https://phabricator.wikimedia.org/T167905#3432694 (10RobH) [20:32:41] (03CR) 10RobH: [C: 032] setting labpuppetmaster100[12] install params [puppet] - 10https://gerrit.wikimedia.org/r/364847 (owner: 10RobH) [20:37:21] (03PS1) 10Chad: Adding MinervaNeue to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364848 [20:41:23] (03CR) 10Chad: [C: 032] Adding MinervaNeue to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364848 (owner: 10Chad) [20:42:41] (03Merged) 10jenkins-bot: Adding MinervaNeue to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364848 (owner: 10Chad) [20:42:54] (03CR) 10jenkins-bot: Adding MinervaNeue to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364848 (owner: 10Chad) [20:43:20] !log demon@tin Started scap: Rebuilding l10n cache for new skin [20:43:29] jdlrobson: Ok, started ^ [20:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:38] Just gotta wait for it + beta to finish the same [20:43:41] Then we can enable in beta [20:49:55] RainbowSprinkles: sweet [20:50:14] (03PS1) 10Chad: Enable MinervaNeue in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364856 [20:50:16] There's the next step ^ [20:50:18] Once scap is done [20:51:46] (03Abandoned) 10EBernhardson: Install ::statistics::packages to stat1004 [puppet] - 10https://gerrit.wikimedia.org/r/348669 (https://phabricator.wikimedia.org/T163177) (owner: 10EBernhardson) [20:51:51] Probably typos abound, I didn't look close [20:53:44] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407#3432797 (10RobH) update from irc chat with @chasemp: please install these hosts with jessie. [20:53:49] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labweb100[12].wikimedia.org - https://phabricator.wikimedia.org/T167820#3432798 (10RobH) update from irc chat with @chasemp: please install these hosts with jessie. [20:53:52] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labnet100[34] - https://phabricator.wikimedia.org/T165779#3432799 (10RobH) update from irc chat with @chasemp: please install these hosts with jessie. [20:53:55] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labnet100[34] - https://phabricator.wikimedia.org/T165779#3432800 (10RobH) update from irc chat with @chasemp: please install these hosts with jessie. [20:54:58] (03CR) 10Jdlrobson: [C: 04-1] Enable MinervaNeue in labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364856 (owner: 10Chad) [20:55:02] RainbowSprinkles: yup 1 typo [20:55:18] also needs to happen after MobileFrontend [20:55:32] (03CR) 10Jdlrobson: [C: 04-1] Enable MinervaNeue in labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364856 (owner: 10Chad) [20:55:38] RainbowSprinkles: want me to amend? [20:56:04] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labweb100[12].wikimedia.org - https://phabricator.wikimedia.org/T167820#3432804 (10RobH) [20:56:18] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labpuppetmaster100[12].wikimedia.org - https://phabricator.wikimedia.org/T167905#3432805 (10RobH) [20:56:30] (03PS2) 10Jdlrobson: Enable MinervaNeue in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364856 (owner: 10Chad) [20:56:32] RainbowSprinkles: did it any wa y:) [20:56:34] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labnet100[34] - https://phabricator.wikimedia.org/T165779#3432806 (10RobH) [20:56:40] (03CR) 10Jdlrobson: [C: 031] Enable MinervaNeue in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364856 (owner: 10Chad) [20:56:47] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407#3432807 (10RobH) [20:56:51] (03CR) 10Jdlrobson: [C: 031] Enable MinervaNeue in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364856 (owner: 10Chad) [21:03:42] (03CR) 10Chad: [C: 032] Enable MinervaNeue in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364856 (owner: 10Chad) [21:05:01] (03PS1) 10Ottomata: Set up separate EventLogging Analytics MySQL Consumer process for eventbus events [puppet] - 10https://gerrit.wikimedia.org/r/364881 [21:05:17] (03Merged) 10jenkins-bot: Enable MinervaNeue in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364856 (owner: 10Chad) [21:06:42] (03CR) 10jerkins-bot: [V: 04-1] Set up separate EventLogging Analytics MySQL Consumer process for eventbus events [puppet] - 10https://gerrit.wikimedia.org/r/364881 (owner: 10Ottomata) [21:07:24] (03PS2) 10Ottomata: Set up separate EventLogging Analytics MySQL Consumer process for eventbus events [puppet] - 10https://gerrit.wikimedia.org/r/364881 [21:07:31] (03CR) 10jenkins-bot: Enable MinervaNeue in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364856 (owner: 10Chad) [21:10:58] PROBLEM - HHVM rendering on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:10:58] PROBLEM - Host pc2006 is DOWN: PING CRITICAL - Packet loss = 100% [21:11:58] RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 74381 bytes in 6.004 second response time [21:13:33] (03CR) 10Ottomata: "Doing! https://puppet-compiler.wmflabs.org/7047/eventlog1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/364881 (owner: 10Ottomata) [21:13:35] (03CR) 10Ottomata: [C: 032] Set up separate EventLogging Analytics MySQL Consumer process for eventbus events [puppet] - 10https://gerrit.wikimedia.org/r/364881 (owner: 10Ottomata) [21:15:35] 10Operations, 10Release-Engineering-Team, 10Epic, 10Services (watching): FY2017/18 Program 6 - Outcome 2: Developers are able to develop and test their applications through a unified pipeline towards production deployment. - https://phabricator.wikimedia.org/T170480#3432874 (10greg) [21:15:55] 10Operations, 10Release-Engineering-Team, 10Epic, 10Services (watching): FY2017/18 Program 6 - Outcome 2: Developers are able to develop and test their applications through a unified pipeline towards production deployment. - https://phabricator.wikimedia.org/T170480#3432874 (10greg) [21:15:59] 10Operations, 10Release-Engineering-Team, 10Epic, 10Services (watching): FY2017/18 Program 6: Streamlined Service delivery - https://phabricator.wikimedia.org/T170453#3431889 (10greg) [21:16:03] 10Operations, 10MediaWiki-Containers, 10Release-Engineering-Team, 10Epic, and 3 others: Streamlined Service Delivery: Outcome 2, Objective 3: Integrated, container-based development environment - https://phabricator.wikimedia.org/T170456#3432891 (10greg) [21:17:11] 10Operations, 10MediaWiki-Containers, 10Release-Engineering-Team, 10Epic, and 3 others: FY2017/18 Program 6 - Outcome 2 - Objective 3: Integrated, container-based development environment - https://phabricator.wikimedia.org/T170456#3431976 (10greg) [21:18:38] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-eventbus [21:18:46] ^ is me [21:18:47] on it. [21:18:57] (03PS1) 10Ottomata: Use proper topics variable for new eventlogging mysql eventbus consumer [puppet] - 10https://gerrit.wikimedia.org/r/364894 [21:18:59] (03CR) 10Ottomata: [V: 032 C: 032] Use proper topics variable for new eventlogging mysql eventbus consumer [puppet] - 10https://gerrit.wikimedia.org/r/364894 (owner: 10Ottomata) [21:19:08] 10Operations, 10Release-Engineering-Team, 10Epic, 10Services (watching): FY2017/18 Program 6 - Outcome 2 - Objective 2: Set up a continuous integration and deployment pipeline - https://phabricator.wikimedia.org/T170481#3432906 (10greg) [21:21:48] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [21:22:08] !log demon@tin Finished scap: Rebuilding l10n cache for new skin (duration: 38m 47s) [21:22:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:28] RECOVERY - Work requests waiting in Zuul Gearman server https://grafana.wikimedia.org/dashboard/db/zuul-gearman on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] [21:24:38] (03CR) 10Dzahn: "yes, ack, it needs puppet code changes too, i will look at adding them" [puppet] - 10https://gerrit.wikimedia.org/r/364504 (owner: 10Dzahn) [21:24:56] RoanKattouw: I suspect that this was the problem is related to this, but i am not sure [21:25:02] if it was, you should start seeing your events again [21:25:36] we still have some things to do to make this work better (this is a couple of things that changed recently that are new), but it shouldn't bother the usual eventlogging events anymore [21:25:53] RainbowSprinkles: so far so good [21:26:10] RainbowSprinkles: did i18n finish? [21:26:22] Yep, full scap done [21:26:24] oops sorry wrong chat [21:26:25] That includes l10n [21:26:29] but, you got the message :) [21:27:11] jdlrobson: Should all be in beta already by now too [21:27:22] RainbowSprinkles: yup looks like it :D [21:27:32] im gonna do the other merges now [21:27:41] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: MinervaNeue config (duration: 00m 47s) [21:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:39] !log demon@tin Synchronized wmf-config/InitialiseSettings-labs.php: MinervaNeue config (duration: 00m 46s) [21:28:41] ottomata: OK lemme try another event hten [21:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:20] Submitted some, I guess I have to wait 5 mins now [21:29:36] jdlrobson: Ok, now all config swaps are live in production too. All that's left for tomorrow is backporting your two patches to wmf.7 and 9 (which we can prep today), then the production config swap [21:29:45] !log demon@tin Synchronized wmf-config/mobile.php: MinervaNeue config (duration: 00m 46s) [21:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:07] RainbowSprinkles: awesome. [21:32:14] (03Abandoned) 10Dzahn: Revert "Revert "disable base monitoring for labtest* machines"" [puppet] - 10https://gerrit.wikimedia.org/r/364504 (owner: 10Dzahn) [21:38:32] (03PS1) 10Andrew Bogott: Nodepool: Slow down VM spawns by a lot. [puppet] - 10https://gerrit.wikimedia.org/r/364901 [21:40:08] !log mobrovac@tin Started deploy [recommendation-api/deploy@ca816ac]: (no justification provided) [21:40:08] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:40:19] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:39] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:40:58] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [21:41:39] 10Operations, 10Cloud-VPS: rack/setup/install labpuppetmaster100[12].wikimedia.org - https://phabricator.wikimedia.org/T167905#3433102 (10RobH) a:05RobH>03Andrew [21:41:50] 10Operations, 10Cloud-VPS: rack/setup/install labpuppetmaster100[12].wikimedia.org - https://phabricator.wikimedia.org/T167905#3349071 (10RobH) All setup and ready for Andrew to take over. [21:42:22] legoktm: are you around? I'm getting a weird error "/Users/jrobson/git/core/includes/services/ServiceContainer.php: No such service: Minerva.Config" [21:42:33] !log mobrovac@tin Finished deploy [recommendation-api/deploy@ca816ac]: (no justification provided) (duration: 02m 24s) [21:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:57] (03CR) 10Rush: [C: 031] "sounds good and if ok then move back to 6 seems sane" [puppet] - 10https://gerrit.wikimedia.org/r/364901 (owner: 10Andrew Bogott) [21:44:31] (03PS2) 10Andrew Bogott: Nodepool: Slow down VM spawns by a lot. [puppet] - 10https://gerrit.wikimedia.org/r/364901 (https://phabricator.wikimedia.org/T170492) [21:45:23] andrewbogott: Increasing Nodepool capacity from 25 to say 30 or 40 - if that would become a wish from RelEng, is that something we could accommodate with relative ease, or would that be blocked on further allocation and planning? [21:45:49] it seems reaching capacity and slowing down CR by 1-2 hours is becoming increasingly common. [21:46:00] Krinkle: My gut instinct is that increasing the pool size is less trouble than increasing the spawn rate [21:46:24] andrewbogott: (I'm not really sure what the spawn rate is..) [21:46:25] But I think what you're seeing today is an actual bug, like, nova getting overwhelmed and throwing away/killing nodes [21:46:46] Krinkle: some of the slowness is delay in time to dispose of and build new test nodes [21:47:03] (03CR) 10Andrew Bogott: [C: 032] Nodepool: Slow down VM spawns by a lot. [puppet] - 10https://gerrit.wikimedia.org/r/364901 (https://phabricator.wikimedia.org/T170492) (owner: 10Andrew Bogott) [21:47:07] Is this rate the interval at which it checks needs from Gerrit/Jenkins and decides to spawn vms? [21:47:11] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:47:24] or is it the minimum time between spawning nodes? [21:47:39] Krinkle: I'm not 100% sure. I think it's just that right now if it sees that it needs 10 new nodes it has a delay between launches [21:47:44] (03CR) 10Dzahn: "yes, it should also have a reverse record. let me add it" [dns] - 10https://gerrit.wikimedia.org/r/364798 (owner: 10Andrew Bogott) [21:47:44] yes, right, the second thing :) [21:47:46] ok [21:48:00] My current thought is that that setting is causing the most trouble and also getting us the least gain [21:48:13] https://phabricator.wikimedia.org/T170492 <- part of today's story [21:48:21] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:48:41] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:48:52] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:49:36] nvm worked it out [21:49:56] andrewbogott: Aye, yeah, I wasn't talking about today actually (havent' seen a CI backlog today) [21:52:25] Krinkle: oh, huh. There definitely was one :) [21:52:36] I've been in meetings all day [21:53:48] oh man, huge backlog [21:54:11] !log restarting nodepool to pick up a config change [21:54:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:21] (03PS1) 10Dzahn: fix IPv6 reverse record for kraz.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/364904 [21:58:31] (03PS1) 10RobH: labnodepool1002 install_server updates [puppet] - 10https://gerrit.wikimedia.org/r/364905 [21:59:38] (03PS2) 10RobH: labnodepool1002 install_server updates [puppet] - 10https://gerrit.wikimedia.org/r/364905 [22:00:02] (03CR) 10RobH: [C: 032] labnodepool1002 install_server updates [puppet] - 10https://gerrit.wikimedia.org/r/364905 (owner: 10RobH) [22:04:53] CI backed up again I see :/ [22:06:50] !log mobrovac@tin Started deploy [recommendation-api/deploy@d5076c2]: (no justification provided) [22:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:21] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [22:07:31] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [22:07:59] _joe_: ^^^ [22:08:01] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [22:08:21] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy [22:08:21] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [22:08:31] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [22:08:40] !log mobrovac@tin Finished deploy [recommendation-api/deploy@d5076c2]: (no justification provided) (duration: 01m 50s) [22:08:41] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [22:08:41] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [22:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:11] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [22:09:21] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy [22:12:07] (03PS2) 10Dzahn: Add AAAA for labtestpuppetmaster2001 [dns] - 10https://gerrit.wikimedia.org/r/364798 (owner: 10Andrew Bogott) [22:12:09] (03CR) 10Dzahn: Add AAAA for labtestpuppetmaster2001 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/364798 (owner: 10Andrew Bogott) [22:13:09] (03CR) 10Dzahn: "see inline comment on PS1 for what i changed/added" [dns] - 10https://gerrit.wikimedia.org/r/364798 (owner: 10Andrew Bogott) [22:16:26] (03CR) 10Dzahn: "it's in the wrong file." [dns] - 10https://gerrit.wikimedia.org/r/364904 (owner: 10Dzahn) [22:23:29] 10Operations, 10Release-Engineering-Team, 10Epic, 10Services (watching): FY2017/18 Program 6: Streamlined Service delivery - https://phabricator.wikimedia.org/T170453#3433422 (10GWicke) [22:24:18] 10Operations, 10Epic, 10Goal, 10Services (later): End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3433431 (10GWicke) [22:25:07] gwicke: do those subtasks for T170453 make sense as direct subtasks of the main annual plan goal or do they better fit as subtasks of one of the outcome/object pairs? [22:25:07] T170453: FY2017/18 Program 6: Streamlined Service delivery - https://phabricator.wikimedia.org/T170453 [22:26:40] greg-g: good question [22:26:52] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:26:53] gwicke: I'll let you stew :) [22:27:07] the requirements touches on several outcomes [22:27:13] *touch [22:28:08] I was also wondering about quarterly goals vs. outcomes / objectives [22:28:21] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:28:31] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:28:51] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 3.670 second response time [22:28:58] currently, one of the sub-tasks is the ops Q1 goal, which doesn't directly correspond to one of the objectives [22:29:16] gwicke: yeah, our quarterly goals are usualy in service to an objective, not an objective itself [22:30:15] pragmatically, we can always re-parent fairly easily [22:30:39] there was a discussion about providing high level entrypoints for people to follow along [22:30:59] which might affect how we want to present / structure the high level program task [22:31:09] yeah, nothing's permanent in the hierarchy in phab, so we can adjust later [22:31:43] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:32:26] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [22:37:36] 10Operations, 10ArchCom-RfC, 10Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3433502 (10GWicke) [22:41:51] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:44:51] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.006 second response time [22:45:31] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [22:49:21] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [22:58:35] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407#3433685 (10RobH) [22:58:51] PROBLEM - HHVM jobrunner on mw1300 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170712T2300). Please do the needful. [23:00:04] RoanKattouw, Smalyshev, and jan_drewniak: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:18] So not it [23:00:19] here [23:00:22] Hmm hold on [23:00:27] My patch may need to not be deployed [23:00:31] * RoanKattouw checks with Trevor B [23:00:36] o/ [23:00:51] RECOVERY - HHVM jobrunner on mw1300 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.001 second response time [23:01:02] I can SWAT [23:02:04] (03PS3) 10Thcipriani: Index deletes everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363669 (https://phabricator.wikimedia.org/T163235) (owner: 10Smalyshev) [23:02:07] Yeah, cancel mine [23:02:10] I'll remove it from the wiki page [23:02:17] thanks [23:02:27] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363669 (https://phabricator.wikimedia.org/T163235) (owner: 10Smalyshev) [23:03:19] (03Merged) 10jenkins-bot: Index deletes everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363669 (https://phabricator.wikimedia.org/T163235) (owner: 10Smalyshev) [23:03:31] (03CR) 10jenkins-bot: Index deletes everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363669 (https://phabricator.wikimedia.org/T163235) (owner: 10Smalyshev) [23:04:35] SMalyshev: pulled your change on mwdebug1002 if there's anything to check there [23:06:12] thcipriani: I can't really check it as it requires admin access and I already checked it on wikis it was deployed before [23:06:30] this is just extending the list of wikis on which it is enabled, and I don't have admin on those [23:06:41] SMalyshev: gotcha, will sync live [23:07:41] SMalyshev: So tldr of that feature is "index titles that have previously existed, in case someone wants to search by them (eg: undelete?)" -- cool! [23:08:31] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:363669|Index deletes everywhere]] T163235 (duration: 00m 47s) [23:08:37] ^ SMalyshev live everywhere [23:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:44] T163235: Archive search deployment plan - https://phabricator.wikimedia.org/T163235 [23:08:50] thcipriani: great, thanks [23:09:31] PROBLEM - MegaRAID on db2019 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [23:09:33] ACKNOWLEDGEMENT - MegaRAID on db2019 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T170503 [23:09:36] 10Operations, 10ops-codfw: Degraded RAID on db2019 - https://phabricator.wikimedia.org/T170503#3433715 (10ops-monitoring-bot) [23:09:56] 10Operations, 10Cloud-VPS: rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407#3433719 (10RobH) a:05RobH>03chasemp [23:10:08] 10Operations, 10Cloud-VPS: rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407#3363615 (10RobH) [23:14:09] RECOVERY - MariaDB Slave Lag: s2 on dbstore1001 is OK: OK slave_sql_lag Replication lag: 89945.29 seconds [23:18:09] (03PS3) 10Dzahn: Add maiwikimedia to Apache conf [puppet] - 10https://gerrit.wikimedia.org/r/361296 (https://phabricator.wikimedia.org/T168782) (owner: 10Urbanecm) [23:24:25] jan_drewniak: your change is live on mwdebug1002, check please [23:24:41] (03PS1) 10Dzahn: decom mw1196 [puppet] - 10https://gerrit.wikimedia.org/r/364918 (https://phabricator.wikimedia.org/T170441) [23:25:19] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 21 probes of 292 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [23:26:28] thcipriani: yup looks fine, good to sync [23:26:36] jan_drewniak: ok, going live [23:28:44] !log thcipriani@tin Synchronized php-1.30.0-wmf.9/extensions/CirrusSearch/resources/ext.cirrus.explore-similar.js: SWAT: [[gerrit:364834|Adding full URLs to Explore Similar API calls]] T149809 T164856 (duration: 00m 47s) [23:28:51] ^ jan_drewniak live everywhere [23:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:56] T149809: [A/B Test] Add 'explore similar' pages, categories and suggested languages in search results - https://phabricator.wikimedia.org/T149809 [23:28:56] T164856: A/B Test: explore similar - verify data coming in is good - https://phabricator.wikimedia.org/T164856 [23:29:13] thcipriani: thanks! [23:30:19] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 15 probes of 292 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [23:33:02] (03PS2) 10Dzahn: decom mw1196 [puppet] - 10https://gerrit.wikimedia.org/r/364918 (https://phabricator.wikimedia.org/T170441) [23:35:06] (03CR) 10Dzahn: [C: 032] Add maiwikimedia to Apache conf [puppet] - 10https://gerrit.wikimedia.org/r/361296 (https://phabricator.wikimedia.org/T168782) (owner: 10Urbanecm) [23:36:19] PROBLEM - HHVM jobrunner on mw1306 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [23:37:19] RECOVERY - HHVM jobrunner on mw1306 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [23:37:36] man, why does it always have to scare me with one of those after touching apache config [23:40:49] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission mw1196 - https://phabricator.wikimedia.org/T170441#3433767 (10RobH) [23:41:55] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission mw1196 - https://phabricator.wikimedia.org/T170441#3431403 (10RobH) In the future, before starting any actual decommission steps, the checklist MUST be populated in the task description. This ensures we follow the proper procedure (which has been... [23:42:32] (03Abandoned) 10Dzahn: decom mw1196 [puppet] - 10https://gerrit.wikimedia.org/r/364918 (https://phabricator.wikimedia.org/T170441) (owner: 10Dzahn) [23:44:49] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission mw1196 - https://phabricator.wikimedia.org/T170441#3433775 (10RobH) [23:45:48] 10Operations, 10ops-eqiad, 10hardware-requests: Decommission mw1196 - https://phabricator.wikimedia.org/T170441#3431403 (10RobH) a:03RobH