[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170713T0000). [00:04:31] what goodies is the phabricator phaerie bringing this week? [00:04:50] lat week had fancy new project icons [00:22:04] 10Operations, 10ops-codfw: Degraded RAID on db2019 - https://phabricator.wikimedia.org/T170503#3433715 (10RobH) This is an old R510 that is out of warranty. We should check with our #DBA team and see if this particular system is scheduled for decom anytime soon. [00:26:28] (03PS3) 10Dzahn: Add AAAA for labtestpuppetmaster2001 [dns] - 10https://gerrit.wikimedia.org/r/364798 (owner: 10Andrew Bogott) [00:26:57] RECOVERY - salt-minion processes on labtestpuppetmaster2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:28:28] (03PS4) 10Dzahn: Add AAAA for labtestpuppetmaster2001 [dns] - 10https://gerrit.wikimedia.org/r/364798 (owner: 10Andrew Bogott) [00:29:53] (03CR) 10Dzahn: [C: 032] Add AAAA for labtestpuppetmaster2001 [dns] - 10https://gerrit.wikimedia.org/r/364798 (owner: 10Andrew Bogott) [00:30:07] PROBLEM - salt-minion processes on labtestpuppetmaster2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [00:35:25] (03CR) 10Dzahn: "[labtestpuppetmaster2001:~] $ host labtestpuppetmaster2001.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/364798 (owner: 10Andrew Bogott) [00:56:24] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3433897 (10Dzahn) It has IPv6 and reverse now. [01:14:57] 10Operations, 10DC-Ops: Information missing from racktables - https://phabricator.wikimedia.org/T150651#3433938 (10faidon) a:03RobH The updated list of devices missing model/number can be found below. A bunch of them are the new cp40xx. A few more are also online, which means that we can locate it from [[ h... [01:52:27] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 30 probes of 299 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:53:07] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 22 probes of 292 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [01:57:07] RECOVERY - salt-minion processes on labtestpuppetmaster2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:57:29] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 12 probes of 299 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:58:08] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 15 probes of 292 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [02:00:17] PROBLEM - salt-minion processes on labtestpuppetmaster2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [02:18:31] 10Operations: unaccepted salt keys - https://phabricator.wikimedia.org/T170510#3433974 (10Dzahn) [02:18:34] 10Operations: unaccepted salt keys - https://phabricator.wikimedia.org/T170510#3433986 (10Dzahn) p:05Triage>03Low [02:24:57] RECOVERY - salt-minion processes on labtestpuppetmaster2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:26:22] !log labtestpuppetmaster2001 - flapping icinga alerts about salt-minion starting and stopping constantly - there is an accepted salt-key but it was rejected by the master, server was reinstalled but still old key - deleted old key, accepted new key (T167157) [02:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:26:40] T167157: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157 [02:28:56] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2019 - https://phabricator.wikimedia.org/T170503#3433998 (10Peachey88) [02:30:13] 10Operations, 10ops-codfw, 10Cloud-VPS, 10Patch-For-Review: rack/setup/install labtestpuppetmaster2001 - https://phabricator.wikimedia.org/T167157#3434011 (10Dzahn) root@labtestpuppetmaster2001:~# ip a s | grep inet6 ... inet6 2620:0:860:4:208:80:153:108/64 scope global mngtmpaddr dynamic --- host l... [02:34:51] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.7) (duration: 09m 54s) [02:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:27] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.9) (duration: 07m 57s) [03:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:35] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Jul 13 03:09:35 UTC 2017 (duration 7m 8s) [03:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:47:32] (03CR) 10KartikMistry: [C: 031] [WIP] Make compact language links default for all Wikipedias except en and de [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364428 (owner: 10Amire80) [04:42:45] (03PS7) 10Krinkle: mediawiki: Remove broken wikidata.org/ontology Apache alias [puppet] - 10https://gerrit.wikimedia.org/r/361801 (https://phabricator.wikimedia.org/T169023) [04:42:52] (03CR) 10Krinkle: "Scheduled for Puppet SWAT tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/361801 (https://phabricator.wikimedia.org/T169023) (owner: 10Krinkle) [04:45:33] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2019 - https://phabricator.wikimedia.org/T170503#3434091 (10Marostegui) Hi Rob, It is scheduled for decom but I don't know if this will happen any time soon. It is s4 master, so if we happen to have some old disks from other hosts around the DC it would b... [05:26:37] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0 [05:27:19] <_joe_> mobrovac: \o/ [05:37:42] ACKNOWLEDGEMENT - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp3009_v4, cp3009_v6 Ema https://phabricator.wikimedia.org/T148422 [05:37:42] ACKNOWLEDGEMENT - IPsec on cp2012 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp3009_v4, cp3009_v6 Ema https://phabricator.wikimedia.org/T148422 [05:37:42] ACKNOWLEDGEMENT - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp3009_v4, cp3009_v6 Ema https://phabricator.wikimedia.org/T148422 [05:37:42] ACKNOWLEDGEMENT - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp3009_v4, cp3009_v6 Ema https://phabricator.wikimedia.org/T148422 [05:38:08] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [05:40:07] RECOVERY - Check systemd state on bast3002 is OK: OK - running: The system is fully operational [05:40:07] ACKNOWLEDGEMENT - Check systemd state on mw1260 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Giuseppe Lavagetto reimaging under way [05:40:07] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1260 is CRITICAL: Host mw1260 is not in mediawiki-installation dsh group Giuseppe Lavagetto reimaging under way [05:40:07] ACKNOWLEDGEMENT - puppet last run on mw1260 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 25 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[jobchron],Service[jobrunner] Giuseppe Lavagetto reimaging under way [05:45:21] 10Operations, 10monitoring, 10netops: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144#3434183 (10ayounsi) Test instance upgraded to the latest master. [05:47:17] RECOVERY - Check for valid instance states on labnodepool1001 is OK: nodepool state management is OK [05:52:57] (03PS2) 10Muehlenhoff: Remove sshd options specific to SSH protocol 1 [puppet] - 10https://gerrit.wikimedia.org/r/364682 (https://phabricator.wikimedia.org/T170298) [06:11:07] RECOVERY - salt-minion processes on stat1006 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [06:11:37] !log fixed salt setup for reimaged stat1006 [06:11:42] (03CR) 10Ema: varnish: Avoid std.fileread() and use new errorpage template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [06:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:52] !log restricting ssh algorithms on network devices - T170369 [06:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:03] T170369: Remove unsecure SSH algorithms on network devices - https://phabricator.wikimedia.org/T170369 [06:27:59] (03PS1) 10Giuseppe Lavagetto: recommendation-api: check robots.txt [puppet] - 10https://gerrit.wikimedia.org/r/364944 (https://phabricator.wikimedia.org/T165760) [06:31:35] (03CR) 10Giuseppe Lavagetto: [C: 032] recommendation-api: check robots.txt [puppet] - 10https://gerrit.wikimedia.org/r/364944 (https://phabricator.wikimedia.org/T165760) (owner: 10Giuseppe Lavagetto) [06:33:24] (03PS1) 10Muehlenhoff: Enable base::firewall for labtestpuppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/364945 [06:40:21] (03PS1) 10Muehlenhoff: Enable base::firewall for labtestpuppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/364946 [06:42:20] <_joe_> !log rolling restart of pybal on low-traffic balancers [06:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:06] !log Manually deploy some alter tables on dbstore1001 for enwiki - T166204 [06:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:16] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [06:51:35] 10Operations, 10netops: Remove unsecure SSH algorithms on network devices - https://phabricator.wikimedia.org/T170369#3434238 (10ayounsi) 05Open>03Resolved Pushed to all network devices. We can revisit the automated scans at a later date (or a common network/systems goal). [06:51:46] !log installing nginx security updates on cp4* [06:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:06] !log installing apache security updates on remaining mw1* hosts [06:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:05] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [07:15:14] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [07:16:46] (03PS1) 10Ema: DnsQuery monitor: use IPv4 [debs/pybal] - 10https://gerrit.wikimedia.org/r/364948 (https://phabricator.wikimedia.org/T82747) [07:21:09] 10Operations, 10User-fgiunchedi: Upgrade grafana to 4.4.1 - https://phabricator.wikimedia.org/T169773#3408066 (10MoritzMuehlenhoff) This still needs to be imported to apt.wikimedia.org, right now "apt-get upgrade" on labmon1001 proposes to downgrade grafana back to 4.3.2 [07:24:35] 10Operations, 10Traffic: Non zero rated LVS IPs - https://phabricator.wikimedia.org/T170518#3434273 (10ayounsi) [07:27:04] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1643.60 Read Requests/Sec=586.30 Write Requests/Sec=1.20 KBytes Read/Sec=55058.00 KBytes_Written/Sec=30.40 [07:31:33] (03PS2) 10Ema: DnsQuery monitor: query over IPv4 [debs/pybal] - 10https://gerrit.wikimedia.org/r/364948 (https://phabricator.wikimedia.org/T82747) [07:32:43] (03PS1) 10Elukey: eventlogging_purging_whitelist.tsv: remove unnecessary schemas [puppet] - 10https://gerrit.wikimedia.org/r/364949 (https://phabricator.wikimedia.org/T156933) [07:33:36] (03CR) 10Elukey: [V: 032 C: 032] eventlogging_purging_whitelist.tsv: remove unnecessary schemas [puppet] - 10https://gerrit.wikimedia.org/r/364949 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [07:33:38] !log rebooting netmon1001 [07:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:25] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=80.70 Read Requests/Sec=314.60 Write Requests/Sec=0.60 KBytes Read/Sec=1258.80 KBytes_Written/Sec=18.80 [07:37:00] (03PS3) 10Ema: DnsQuery monitor: query over IPv4 [debs/pybal] - 10https://gerrit.wikimedia.org/r/364948 (https://phabricator.wikimedia.org/T82747) [07:39:33] <_joe_> /win 25 [07:43:33] (03PS1) 10Marostegui: db-eqiad.php: Depool db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364950 (https://phabricator.wikimedia.org/T168661) [07:45:42] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364950 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [07:47:05] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364950 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [07:47:15] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1053 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364950 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [07:47:43] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-MarcoAurelio, 10User-Urbanecm: Logo for sr.wikiquote.org - https://phabricator.wikimedia.org/T168444#3434299 (10Dereckson) Even with a purge, the https://en.wikipedia.org/static/images/project-logos/srwikiquote.png hasn't been purged fro... [07:47:50] (03CR) 10Alexandros Kosiaris: [C: 031] icinga/role:mail::mx: add monitoring of exim queue size [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn) [07:48:13] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1053 - T168661 (duration: 00m 47s) [07:48:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:22] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [07:51:29] 10Operations, 10ops-eqiad, 10Services (watching): restbase-dev1003 stuck after reboot - https://phabricator.wikimedia.org/T169696#3434308 (10MoritzMuehlenhoff) Thanks, confirmed working fine. Due to the power off it picked up the new kernel, so another reboot won't be necessary. [07:54:51] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2019 - https://phabricator.wikimedia.org/T170503#3434309 (10Papaul) @Marostegui this was not assigned to me so i missed it on my dashboard sorry about that. but will look into this first thing in the am once on site. We should have some disks from the dec... [07:55:47] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2019 - https://phabricator.wikimedia.org/T170503#3434311 (10Marostegui) @Papaul no worries! :-) Thanks! [07:56:04] PROBLEM - nutcracker process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:56:05] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:56:34] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:56:55] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [07:56:55] RECOVERY - nutcracker process on thumbor1002 is OK: PROCS OK: 1 process with UID = 115 (nutcracker), command name nutcracker [07:57:24] RECOVERY - Check systemd state on mw1260 is OK: OK - running: The system is fully operational [07:57:24] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:57:43] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM, sudo rules won't work tho" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn) [07:58:02] 10Operations, 10Traffic, 10monitoring, 10Patch-For-Review: Performance impact evaluation of enabling nginx-lua and nginx-lua-prometheus on tlsproxy - https://phabricator.wikimedia.org/T161101#3434317 (10ema) 05Open>03Resolved a:03ema >>! In T161101#3431603, @faidon wrote: > @Ema, it seems like the ta... [07:58:12] !log Deploy alter table on s4 - db1053 - T168661 [07:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:22] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [08:08:14] (03PS6) 10Amire80: Make compact language links default for all Wikipedias except en and de [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364428 [08:09:00] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw1260.eqiad.wmnet [08:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:28] !log powercycle pc2006, was down [08:11:35] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [08:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:14] RECOVERY - Host pc2006 is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms [08:15:14] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic2007.codfw.wmnet because of too many down!: prometheus_80 - Could not depool server prometheus2004.codfw.wmnet because of too many down!: wdqs_80 - Could not depool server wdqs2001.codfw.wmnet because of too many down! [08:15:27] (03CR) 10Giuseppe Lavagetto: Generalize state management, allow multiple run modes (036 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363217 (owner: 10Giuseppe Lavagetto) [08:15:46] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga loses downtime entries, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3434362 (10akosiaris) [08:16:14] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [08:19:14] !log upgrade grafana to 4.4.1 on krypton - T169773 [08:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:23] T169773: Upgrade grafana to 4.4.1 - https://phabricator.wikimedia.org/T169773 [08:20:04] 10Operations, 10User-fgiunchedi: Upgrade grafana to 4.4.1 - https://phabricator.wikimedia.org/T169773#3434369 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi @MoritzMuehlenhoff indeed! now upgraded [08:21:55] RECOVERY - mediawiki-installation DSH group on mw1260 is OK: OK [08:25:36] 10Operations, 10ops-codfw: pc2006 crashed - https://phabricator.wikimedia.org/T170520#3434378 (10jcrespo) [08:26:17] (03PS7) 10Giuseppe Lavagetto: Generalize state management, allow multiple run modes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363217 [08:26:19] (03PS1) 10Ema: cache::misc: enable nginx-lua-prometheus [puppet] - 10https://gerrit.wikimedia.org/r/364952 [08:30:01] 10Operations, 10ops-codfw: pc2006 crashed - https://phabricator.wikimedia.org/T170520#3434399 (10jcrespo) I think RAID is ok, but I wouldn't mind a second opinion @marostegui - was noisy just because unclean umount: ```lines=20 [ 4.993319] ata1: SATA max UDMA/133 abar m2048@0x91c01000 port 0x91c01100 irq 38... [08:32:42] !log enabling jobrunner/jobchron on mw1260 (jessie video scaler) [08:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:44] RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [08:34:47] (03PS1) 10Elukey: eventlogging_cleaner.py: separate logs betweem stdout/stderr$ [puppet] - 10https://gerrit.wikimedia.org/r/364956 (https://phabricator.wikimedia.org/T156933) [08:36:41] (03PS2) 10Elukey: eventlogging_cleaner.py: split logs between stdout/stderr$ [puppet] - 10https://gerrit.wikimedia.org/r/364956 (https://phabricator.wikimedia.org/T156933) [08:38:04] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:39:04] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [08:39:44] (03CR) 10Filippo Giunchedi: [C: 031] cache::misc: enable nginx-lua-prometheus [puppet] - 10https://gerrit.wikimedia.org/r/364952 (owner: 10Ema) [08:41:04] PROBLEM - MariaDB Slave Lag: s4 on db1053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2484.24 seconds [08:41:18] I downtimed that host [08:41:23] when? [08:41:30] there was no downtime today morning [08:41:33] Not long ago [08:41:38] Maybe 1h ago [08:41:43] let me check the logs [08:41:46] (03CR) 10Elukey: [C: 032] eventlogging_cleaner.py: split logs between stdout/stderr$ [puppet] - 10https://gerrit.wikimedia.org/r/364956 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [08:41:54] * volans checking [08:42:26] [1499931875] EXTERNAL COMMAND: SCHEDULE_SVC_DOWNTIME;db1053;MariaDB Slave Lag: s4;1499931859;1500284659;1;0;7200;Marostegui;T168661 [08:42:26] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [08:42:29] [1499931875] SERVICE DOWNTIME ALERT: db1053;MariaDB Slave Lag: s4;STARTED; Service has entered a period of scheduled downtime [08:45:06] how long was that period of downtime I wonder [08:45:25] I can't tell from the params in the log message [08:45:34] (03CR) 10Giuseppe Lavagetto: [C: 031] DnsQuery monitor: query over IPv4 [debs/pybal] - 10https://gerrit.wikimedia.org/r/364948 (https://phabricator.wikimedia.org/T82747) (owner: 10Ema) [08:45:42] apergos: from 1499931859 to 1500284659 [08:45:42] which is [08:45:57] root@einsteinium:/var/log/icinga# date -s @1499931859 [08:45:57] Thu Jul 13 07:44:19 UTC 2017 [08:46:01] root@einsteinium:/var/log/icinga# date -s @1500284659 [08:46:01] Mon Jul 17 09:44:19 UTC 2017 [08:46:08] (03CR) 10Giuseppe Lavagetto: [C: 032] Reset the waitIndex when connection is lost or failed [debs/pybal] - 10https://gerrit.wikimedia.org/r/363611 (https://phabricator.wikimedia.org/T169893) (owner: 10Giuseppe Lavagetto) [08:46:20] (03PS1) 10Giuseppe Lavagetto: Reset the waitIndex when connection is lost or failed [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/364958 (https://phabricator.wikimedia.org/T169893) [08:46:25] yeah it's for days [08:46:31] I did it the long way but same result [08:46:33] wtf [08:48:33] (03PS4) 10Ema: DnsQuery monitor: query over IPv4 [debs/pybal] - 10https://gerrit.wikimedia.org/r/364948 (https://phabricator.wikimedia.org/T82747) [08:48:40] (03CR) 10Ema: [V: 032 C: 032] DnsQuery monitor: query over IPv4 [debs/pybal] - 10https://gerrit.wikimedia.org/r/364948 (https://phabricator.wikimedia.org/T82747) (owner: 10Ema) [08:49:44] 10Operations, 10ops-codfw: pc2006 crashed - https://phabricator.wikimedia.org/T170520#3434467 (10Marostegui) I agree, RAID looks good. mdadm looks good, and it didn't log anything relevant and it didn't rename the arrays or anything (which usually happens when it gets back in a weird state) [08:52:31] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga loses downtime entries, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3434481 (10Marostegui) I will log today's one, just to keep a track of them just in case they are useful when checking logs. db1053 just paged: ``... [08:54:07] (03PS1) 10Elukey: eventlogging_cleaner.py: set default loglevel for the main logger [puppet] - 10https://gerrit.wikimedia.org/r/364961 (https://phabricator.wikimedia.org/T156933) [08:56:51] (03PS1) 10Ema: 1.13.8: waitIndex and dnsquery fixes [debs/pybal] - 10https://gerrit.wikimedia.org/r/364962 [08:57:40] (03PS1) 10Ema: DnsQuery monitor: query over IPv4 [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/364963 (https://phabricator.wikimedia.org/T82747) [08:58:46] (03PS1) 10Alexandros Kosiaris: icinga: Bump max_concurrent_checks by 20% [puppet] - 10https://gerrit.wikimedia.org/r/364965 [09:00:16] (03CR) 10Giuseppe Lavagetto: [C: 031] 1.13.8: waitIndex and dnsquery fixes [debs/pybal] - 10https://gerrit.wikimedia.org/r/364962 (owner: 10Ema) [09:00:43] (03CR) 10Ema: [C: 032] 1.13.8: waitIndex and dnsquery fixes [debs/pybal] - 10https://gerrit.wikimedia.org/r/364962 (owner: 10Ema) [09:01:14] (03CR) 10Ema: [C: 032] Reset the waitIndex when connection is lost or failed [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/364958 (https://phabricator.wikimedia.org/T169893) (owner: 10Giuseppe Lavagetto) [09:01:36] (03PS2) 10Ema: DnsQuery monitor: query over IPv4 [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/364963 (https://phabricator.wikimedia.org/T82747) [09:01:43] (03CR) 10Ema: [V: 032 C: 032] DnsQuery monitor: query over IPv4 [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/364963 (https://phabricator.wikimedia.org/T82747) (owner: 10Ema) [09:02:10] (03PS1) 10Ema: 1.13.8: waitIndex and dnsquery fixes [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/364966 [09:02:29] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [09:02:59] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [09:03:54] XioNoX: ^ [09:04:28] XioNoX: in case you want to check OSPF logs, but it might be too late already [09:04:51] (03CR) 10Elukey: [C: 032] eventlogging_cleaner.py: set default loglevel for the main logger [puppet] - 10https://gerrit.wikimedia.org/r/364961 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [09:05:07] thx [09:05:23] context: T170131 [09:05:23] T170131: Recurring varnish-be fetch failures in codfw - https://phabricator.wikimedia.org/T170131 [09:06:24] (03CR) 10Ema: [V: 032 C: 032] 1.13.8: waitIndex and dnsquery fixes [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/364966 (owner: 10Ema) [09:11:06] ema: that one doesn't match with an bfd/ospf flap [09:11:39] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:12:09] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:21:40] !log installing nginx security updates on cp2* [09:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:18] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational [09:24:09] (03PS1) 10Ema: Reset waitIndex when connection is lost in a unclean way [debs/pybal] - 10https://gerrit.wikimedia.org/r/364971 (https://phabricator.wikimedia.org/T169893) [09:25:42] (03PS5) 10Giuseppe Lavagetto: Add coverage report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363350 [09:25:44] (03PS5) 10Giuseppe Lavagetto: Raise test coverage percentage [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363351 [09:25:46] (03PS8) 10Giuseppe Lavagetto: Add future parser run mode [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363808 (https://phabricator.wikimedia.org/T169546) [09:27:12] (03CR) 10Giuseppe Lavagetto: [C: 031] "/me wears brown paper bag" [debs/pybal] - 10https://gerrit.wikimedia.org/r/364971 (https://phabricator.wikimedia.org/T169893) (owner: 10Ema) [09:28:00] (03CR) 10Ema: [C: 032] Reset waitIndex when connection is lost in a unclean way [debs/pybal] - 10https://gerrit.wikimedia.org/r/364971 (https://phabricator.wikimedia.org/T169893) (owner: 10Ema) [09:28:20] (03PS1) 10Ema: Reset waitIndex when connection is lost in a unclean way [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/364973 (https://phabricator.wikimedia.org/T169893) [09:31:04] (03PS1) 10Ema: 1.13.9: Reset waitIndex when connection is lost in a unclean way [debs/pybal] - 10https://gerrit.wikimedia.org/r/364974 [09:32:36] (03CR) 10Ema: [C: 032] 1.13.9: Reset waitIndex when connection is lost in a unclean way [debs/pybal] - 10https://gerrit.wikimedia.org/r/364974 (owner: 10Ema) [09:32:46] (03PS1) 10Ema: 1.13.9: Reset waitIndex when connection is lost in a unclean way [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/364975 [09:33:29] (03CR) 10Ema: [C: 032] Reset waitIndex when connection is lost in a unclean way [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/364973 (https://phabricator.wikimedia.org/T169893) (owner: 10Ema) [09:34:23] (03PS2) 10Ema: 1.13.9: Reset waitIndex when connection is lost in a unclean way [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/364975 [09:34:29] (03CR) 10Ema: [V: 032 C: 032] 1.13.9: Reset waitIndex when connection is lost in a unclean way [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/364975 (owner: 10Ema) [09:41:53] !log pybal 1.13.9 uploaded to apt.w.o [09:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:43] PROBLEM - WDQS HTTP on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.072 second response time [09:42:44] PROBLEM - WDQS SPARQL on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.072 second response time [09:43:34] !log lvs1010: upgrade pybal to 1.13.9 T82747 T154759 [09:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:45] T82747: pybal health checks are ipv4 even for ipv6 vips - https://phabricator.wikimedia.org/T82747 [09:43:45] T154759: Pybal not happy with DNS delays - https://phabricator.wikimedia.org/T154759 [09:45:44] PROBLEM - Restbase root url on restbase2001 is CRITICAL: connect to address 10.192.16.152 and port 7231: Connection refused [09:48:23] PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:49:23] PROBLEM - WDQS HTTP on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.072 second response time [09:52:03] PROBLEM - WDQS SPARQL on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.072 second response time [09:55:44] (03PS1) 10Giuseppe Lavagetto: puppet-compiler: allow more than one runner [puppet] - 10https://gerrit.wikimedia.org/r/364979 [10:00:05] (03PS2) 10Giuseppe Lavagetto: puppet-compiler: allow more than one runner [puppet] - 10https://gerrit.wikimedia.org/r/364979 [10:04:56] !log lvs[12]006: upgrade pybal to 1.13.9 T82747 T154759 [10:05:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:07] T82747: pybal health checks are ipv4 even for ipv6 vips - https://phabricator.wikimedia.org/T82747 [10:05:08] T154759: Pybal not happy with DNS delays - https://phabricator.wikimedia.org/T154759 [10:09:31] !log rebooting graphite2* for kernel update [10:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:41] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-MarcoAurelio, 10User-Urbanecm: Logo for sr.wikiquote.org - https://phabricator.wikimedia.org/T168444#3434717 (10MarcoAurelio) Can somebody explain to me what I did wrong so I don't commit the same mistake again? I don't understand. Thanks. [10:18:26] (03PS3) 10Giuseppe Lavagetto: puppet-compiler: allow more than one runner [puppet] - 10https://gerrit.wikimedia.org/r/364979 [10:20:30] PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] [10:21:05] expected ^ graphite2* have been rebooted [10:24:40] RECOVERY - carbon-frontend-relay metric drops on graphite1001 is OK: OK: Less than 80.00% above the threshold [25.0] [10:28:03] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-MarcoAurelio, 10User-Urbanecm: Logo for sr.wikiquote.org - https://phabricator.wikimedia.org/T168444#3434781 (10Urbanecm) >>! In T168444#3434717, @MarcoAurelio wrote: > Can somebody explain to me what I did wrong so I don't commit the sa... [10:32:12] (03PS4) 10Giuseppe Lavagetto: puppet-compiler: allow more than one runner [puppet] - 10https://gerrit.wikimedia.org/r/364979 [10:35:25] !log installing nginx security updates on cp3* [10:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:32] (03PS1) 10Urbanecm: Add enwiki as import source for specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364983 (https://phabricator.wikimedia.org/T170094) [11:02:39] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-compiler: allow more than one runner [puppet] - 10https://gerrit.wikimedia.org/r/364979 (owner: 10Giuseppe Lavagetto) [11:14:23] PROBLEM - Host nosuchhost is DOWN: PING CRITICAL - Packet loss = 100% [11:20:12] PROBLEM - kartotherian endpoints health on maps-test2002 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: [11:20:12] }/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (expecting [11:20:33] PROBLEM - WDQS SPARQL on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.072 second response time [11:21:12] PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0 [11:21:14] PROBLEM - Nginx local proxy to apache on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 443: Connection refused [11:21:53] PROBLEM - Check systemd state on labtestpuppetmaster2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:22:02] PROBLEM - kartotherian endpoints health on maps-test2001 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: [11:22:02] }/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (expecting [11:22:02] PROBLEM - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 [11:22:03] ACKNOWLEDGEMENT - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T170538 [11:22:05] PROBLEM - Apache HTTP on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 80: Connection refused [11:22:07] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T170538#3434841 (10ops-monitoring-bot) [11:23:03] PROBLEM - Check size of conntrack table on mw1228 is CRITICAL: Return code of 255 is out of bounds [11:23:04] PROBLEM - configured eth on mw1228 is CRITICAL: Return code of 255 is out of bounds [11:23:42] PROBLEM - Check whether ferm is active by checking the default input chain on labtestpuppetmaster2001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [11:23:55] PROBLEM - dhclient process on mw1228 is CRITICAL: Return code of 255 is out of bounds [11:23:55] PROBLEM - Check systemd state on mw1228 is CRITICAL: Return code of 255 is out of bounds [11:23:55] PROBLEM - Check systemd state on mw1260 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:24:02] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:24:52] PROBLEM - mediawiki-installation DSH group on mw1228 is CRITICAL: Host mw1228 is not in mediawiki-installation dsh group [11:24:53] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1228 is CRITICAL: Return code of 255 is out of bounds [11:25:03] PROBLEM - MariaDB Slave SQL: s3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1007, Errmsg: Error Cant create database dinwiki: database exists on query. Default database: dinwiki. [Query snipped] [11:25:44] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:26:02] PROBLEM - Keyholder SSH agent on netmon2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [11:26:32] PROBLEM - kartotherian endpoints health on maps-test2004 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: [11:26:32] }/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (expecting [11:27:22] ACKNOWLEDGEMENT - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 Ema https://phabricator.wikimedia.org/T148422 [11:27:22] ACKNOWLEDGEMENT - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 Ema https://phabricator.wikimedia.org/T148422 [11:27:22] ACKNOWLEDGEMENT - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 Ema https://phabricator.wikimedia.org/T148422 [11:27:22] ACKNOWLEDGEMENT - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 Ema https://phabricator.wikimedia.org/T148422 [11:27:22] ACKNOWLEDGEMENT - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 Ema https://phabricator.wikimedia.org/T148422 [11:27:22] ACKNOWLEDGEMENT - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 Ema https://phabricator.wikimedia.org/T148422 [11:27:42] PROBLEM - nutcracker process on mw1228 is CRITICAL: Return code of 255 is out of bounds [11:28:10] !log restart icinga, it's reporting wrong stuff all over the place [11:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:24] PROBLEM - Host nosuchhost is DOWN: PING CRITICAL - Packet loss = 100% [11:28:51] PROBLEM - kartotherian endpoints health on maps-test2003 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: [11:28:51] }/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (expecting [11:29:52] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:32:59] PROBLEM - Host nosuchhost2 is DOWN: PING CRITICAL - Packet loss = 100% [11:33:25] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-MarcoAurelio, 10User-Urbanecm: Logo for sr.wikiquote.org - https://phabricator.wikimedia.org/T168444#3434847 (10MarcoAurelio) Except for the trick of wget the URL from commons (I downloaded the 2000px file and used an online resizer, usi... [11:34:09] PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0 [11:34:19] PROBLEM - Nginx local proxy to apache on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 443: Connection refused [11:34:59] PROBLEM - Check systemd state on labtestpuppetmaster2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:34:59] PROBLEM - kartotherian endpoints health on maps-test2001 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (expecting: [11:34:59] }/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (expecting [11:34:59] PROBLEM - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 [11:35:00] ACKNOWLEDGEMENT - MD RAID on lvs3001 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T170539 [11:35:03] 10Operations, 10ops-esams: Degraded RAID on lvs3001 - https://phabricator.wikimedia.org/T170539#3434848 (10ops-monitoring-bot) [11:35:08] PROBLEM - Apache HTTP on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 80: Connection refused [11:35:19] PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:36:03] PROBLEM - Check size of conntrack table on mw1228 is CRITICAL: Return code of 255 is out of bounds [11:36:03] PROBLEM - configured eth on mw1228 is CRITICAL: Return code of 255 is out of bounds [11:36:20] PROBLEM - WDQS HTTP on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 387 bytes in 0.072 second response time [11:36:58] PROBLEM - Check systemd state on mw1228 is CRITICAL: Return code of 255 is out of bounds [11:36:59] PROBLEM - dhclient process on mw1228 is CRITICAL: Return code of 255 is out of bounds [11:36:59] PROBLEM - Check systemd state on mw1260 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:40:32] !log stop ircecho, icinga is misbehaving badly, no point it having it around [11:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:24] 10Operations, 10ops-codfw: Degraded RAID on db2019 - https://phabricator.wikimedia.org/T170541#3434877 (10ops-monitoring-bot) [11:56:02] 10Operations, 10ops-codfw: Degraded RAID on db2019 - https://phabricator.wikimedia.org/T170541#3434897 (10Marostegui) [11:56:04] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2019 - https://phabricator.wikimedia.org/T170503#3434899 (10Marostegui) [11:56:21] marostegui: there might be some duplicate coming [11:56:29] yeah, I merged that [11:56:34] and expecting db1066 maybe too :) [11:59:33] 10Operations, 10ops-eqiad: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T170544#3434914 (10ops-monitoring-bot) [12:00:15] there it is :) [12:00:17] I will close it [12:00:37] 10Operations, 10ops-eqiad: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T170544#3434921 (10Marostegui) [12:00:39] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1066 - https://phabricator.wikimedia.org/T169448#3434923 (10Marostegui) [12:01:01] (03PS2) 10Alexandros Kosiaris: icinga: Bump max_concurrent_checks by 20% [puppet] - 10https://gerrit.wikimedia.org/r/364965 [12:01:03] (03PS1) 10Alexandros Kosiaris: icinga: Bump icinga-tmpfs size to 1024m [puppet] - 10https://gerrit.wikimedia.org/r/364991 (https://phabricator.wikimedia.org/T164206) [12:03:51] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: Bump max_concurrent_checks by 20% [puppet] - 10https://gerrit.wikimedia.org/r/364965 (owner: 10Alexandros Kosiaris) [12:04:01] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: Bump icinga-tmpfs size to 1024m [puppet] - 10https://gerrit.wikimedia.org/r/364991 (https://phabricator.wikimedia.org/T164206) (owner: 10Alexandros Kosiaris) [12:06:40] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1053" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364992 [12:07:22] PROBLEM - IPMI Temperature on mw1228 is CRITICAL: Return code of 255 is out of bounds [12:08:39] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1053" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364992 (owner: 10Marostegui) [12:10:14] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1053" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364992 (owner: 10Marostegui) [12:10:24] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1053" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364992 (owner: 10Marostegui) [12:12:02] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1053 - T168661 (duration: 01m 03s) [12:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:15] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [12:22:53] 10Operations, 10Traffic, 10Wikimedia-Site-requests: Optimize Wikipedia PNG Logo - https://phabricator.wikimedia.org/T170546#3434956 (10Rap2h) [12:23:16] ACKNOWLEDGEMENT - Apache HTTP on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 80: Connection refused Muehlenhoff needs reimage, T168613 [12:23:16] ACKNOWLEDGEMENT - Check size of conntrack table on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff needs reimage, T168613 [12:23:16] ACKNOWLEDGEMENT - Check systemd state on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff needs reimage, T168613 [12:23:16] ACKNOWLEDGEMENT - Check the NTP synchronisation status of timesyncd on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff needs reimage, T168613 [12:23:16] ACKNOWLEDGEMENT - Check whether ferm is active by checking the default input chain on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff needs reimage, T168613 [12:23:17] ACKNOWLEDGEMENT - DPKG on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff needs reimage, T168613 [12:23:17] ACKNOWLEDGEMENT - Disk space on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff needs reimage, T168613 [12:23:18] ACKNOWLEDGEMENT - HHVM processes on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff needs reimage, T168613 [12:23:18] ACKNOWLEDGEMENT - HHVM rendering on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 80: Connection refused Muehlenhoff needs reimage, T168613 [12:23:19] ACKNOWLEDGEMENT - IPMI Temperature on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff needs reimage, T168613 [12:23:19] ACKNOWLEDGEMENT - Nginx local proxy to apache on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 443: Connection refused Muehlenhoff needs reimage, T168613 [12:23:20] ACKNOWLEDGEMENT - configured eth on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff needs reimage, T168613 [12:23:20] ACKNOWLEDGEMENT - dhclient process on mw1228 is CRITICAL: Return code of 255 is out of bounds Muehlenhoff needs reimage, T168613 [12:23:21] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1228 is CRITICAL: Host mw1228 is not in mediawiki-installation dsh group Muehlenhoff needs reimage, T168613 [12:23:39] 10Operations, 10Traffic, 10Wikimedia-Site-requests: Optimize Wikipedia PNG Logo - https://phabricator.wikimedia.org/T170546#3434969 (10Rap2h) [12:25:16] ACKNOWLEDGEMENT - Check systemd state on mw1260 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Muehlenhoff jessie scaler migration [12:25:18] ACKNOWLEDGEMENT - puppet last run on mw1260 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 10 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[jobchron],Service[jobrunner] Muehlenhoff jessie scaler migration [12:27:50] 10Operations: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3434983 (10MoritzMuehlenhoff) [12:38:35] (03CR) 10BBlack: [C: 031] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/364952 (owner: 10Ema) [12:41:22] (03CR) 10BBlack: [C: 031] VCL: zero cleanups [puppet] - 10https://gerrit.wikimedia.org/r/363322 (owner: 10Ema) [12:41:53] 10Operations, 10Traffic, 10Wikimedia-Site-requests: Optimize Wikipedia PNG Logo - https://phabricator.wikimedia.org/T170546#3434956 (10Urbanecm) Hi, all logos should be already optimalized by OptiPNG (the command recommended for optimalization is `optipng -o7 `). It is up to discussion if ot... [12:42:20] 10Operations, 10LDAP-Access-Requests: Remove "lucie" from the wmde LDAP group - https://phabricator.wikimedia.org/T170551#3435037 (10Addshore) [12:42:25] 10Operations, 10Traffic, 10Wikimedia-Site-requests: Optimize Wikipedia PNG Logo - https://phabricator.wikimedia.org/T170546#3435052 (10Urbanecm) p:05Triage>03Lowest [12:42:44] 10Operations, 10LDAP-Access-Requests: Remove "lucie" from the wmde LDAP group - https://phabricator.wikimedia.org/T170551#3435055 (10Addshore) [12:43:20] 10Operations, 10LDAP-Access-Requests: Add "chrisneuroth" to wmde LDAP group - https://phabricator.wikimedia.org/T170552#3435056 (10Addshore) [12:43:46] jouncebot: next [12:43:46] In 0 hour(s) and 16 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170713T1300) [12:47:39] (03CR) 10BBlack: [C: 04-1] "Yes, what ema said, but otherwise I think this is a great improvement over the existing situation. (ditto for the followup for browsersec" [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [12:47:49] 10Operations, 10LDAP-Access-Requests: Remove "lucie" from the wmde LDAP group - https://phabricator.wikimedia.org/T170551#3435086 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff I've removed her from the cn=wmde LDAP group (removals don't need manager approval, especially if confirmed by ht... [12:49:40] 10Operations, 10LDAP-Access-Requests: Remove "lucie" from the wmde LDAP group - https://phabricator.wikimedia.org/T170551#3435090 (10Addshore) >>! In T170551#3435086, @MoritzMuehlenhoff wrote: > I've removed her from the cn=wmde LDAP group (removals don't need manager approval, especially if confirmed by https... [12:52:42] !log installing nginx security updates on cp1* [12:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:49] (03CR) 10BBlack: [C: 031] Decom RCStream [puppet] - 10https://gerrit.wikimedia.org/r/364219 (https://phabricator.wikimedia.org/T170157) (owner: 10Ottomata) [12:54:51] 10Operations, 10LDAP-Access-Requests: Add "chrisneuroth" to wmde LDAP group - https://phabricator.wikimedia.org/T170552#3435104 (10Addshore) 05Open>03stalled Marking as stalled until there is an NDA [12:55:47] (03CR) 10BBlack: [C: 031] Decom RCStream (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/364219 (https://phabricator.wikimedia.org/T170157) (owner: 10Ottomata) [12:56:25] (03CR) 10BBlack: "Just switch the long string to a templatefile as in the other one? It's too bad about heredocs :(" [puppet] - 10https://gerrit.wikimedia.org/r/355338 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [13:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170713T1300). Please do the needful. [13:00:05] Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:12] Present [13:00:17] (03PS1) 10Giuseppe Lavagetto: puppet-compiler: fix nginx virtualhost [puppet] - 10https://gerrit.wikimedia.org/r/364996 [13:00:19] o/ [13:00:35] I can SWAT today! [13:00:47] Urbanecm: the usual, can you test it at mwdebug? [13:01:17] or is it the wiki you do not have enough mojo to test? :) [13:01:23] zeljkof: I don't have rights for testing this. [13:01:31] Please just deploy to production if possible :) [13:01:32] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-compiler: fix nginx virtualhost [puppet] - 10https://gerrit.wikimedia.org/r/364996 (owner: 10Giuseppe Lavagetto) [13:01:45] Urbanecm: let me see who helped the last time... :) [13:01:52] MarcoAurelio [13:02:02] in the meantime, reviewing, merging and pushing to mwdebug... [13:02:09] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:02:21] zeljkof: It was MarcoAurelio but he's not here currently [13:02:33] <_joe_> https://cdn.meme.am/cache/instances/folder46/65289046.jpg [13:02:34] just noticed that [13:02:36] <_joe_> sigh [13:03:28] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [13:03:42] can anybody help test a small config change at specieswiki? [13:03:45] 10Operations, 10Traffic, 10Wikimedia-Site-requests: Optimize Wikipedia PNG Logo - https://phabricator.wikimedia.org/T170546#3435139 (10Rap2h) Ok. I'm not sure to understand what you suggest me to do. Can I open a pull request on this Github repo after optimizing all images? Or should I follow this tutorial:... [13:03:52] both Urbanecm and me do not have enough mojo there [13:04:40] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364983 (https://phabricator.wikimedia.org/T170094) (owner: 10Urbanecm) [13:06:07] (03Merged) 10jenkins-bot: Add enwiki as import source for specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364983 (https://phabricator.wikimedia.org/T170094) (owner: 10Urbanecm) [13:06:17] (03CR) 10jenkins-bot: Add enwiki as import source for specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364983 (https://phabricator.wikimedia.org/T170094) (owner: 10Urbanecm) [13:08:22] 10Operations, 10LDAP-Access-Requests: Add "chrisneuroth" to wmde LDAP group - https://phabricator.wikimedia.org/T170552#3435149 (10Tobi_WMDE_SW) Confirming that @christophneuroth is working as a contractor for WMDE and Wikidata. [13:09:22] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-MarcoAurelio, 10User-Urbanecm: Logo for sr.wikiquote.org - https://phabricator.wikimedia.org/T168444#3435164 (10Obsuser) Different alignment of non-text logo part as well as color difference can still be seen by comparing [[https://sr.wi... [13:09:27] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:364983|Add enwiki as import source for specieswiki (T170094)]] (duration: 00m 47s) [13:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:38] T170094: Configuring importers on Wikispecies. - https://phabricator.wikimedia.org/T170094 [13:09:46] (03PS1) 10Addshore: Fix api_log_dir for statistics wmde [puppet] - 10https://gerrit.wikimedia.org/r/365001 (https://phabricator.wikimedia.org/T170472) [13:10:03] (03PS2) 10Addshore: Fix api_log_dir for statistics wmde [puppet] - 10https://gerrit.wikimedia.org/r/365001 (https://phabricator.wikimedia.org/T170472) [13:10:28] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:10:46] Urbanecm: deployed, but since both of us can not test it, nothing else to do I guess [13:10:58] logs seem normal... [13:11:18] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:11:34] no more commits, so... [13:11:42] !log EU SWAT finished [13:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:22] 10Operations, 10Traffic, 10Wikimedia-Site-requests: Optimize Wikipedia PNG Logo - https://phabricator.wikimedia.org/T170546#3434956 (10BBlack) I think the main point here is we'd rather have a reproducible method for optimizing these images which works on our Linux and open-source based infrastructure. Havi... [13:12:24] Urbanecm: thanks for deploying with #releng :) [13:12:39] (03PS3) 10Addshore: Fix api_log_dir for statistics wmde [puppet] - 10https://gerrit.wikimedia.org/r/365001 (https://phabricator.wikimedia.org/T170472) [13:12:43] (03PS3) 10Ema: VCL: zero cleanups [puppet] - 10https://gerrit.wikimedia.org/r/363322 [13:12:48] (03CR) 10Ema: [V: 032 C: 032] VCL: zero cleanups [puppet] - 10https://gerrit.wikimedia.org/r/363322 (owner: 10Ema) [13:14:24] 10Operations, 10Traffic, 10Wikimedia-Site-requests: Optimize Wikipedia PNG Logo - https://phabricator.wikimedia.org/T170546#3435191 (10Framawiki) [13:16:00] (03PS2) 10Ema: cache::misc: enable nginx-lua-prometheus [puppet] - 10https://gerrit.wikimedia.org/r/364952 [13:16:07] (03CR) 10Ema: [V: 032 C: 032] cache::misc: enable nginx-lua-prometheus [puppet] - 10https://gerrit.wikimedia.org/r/364952 (owner: 10Ema) [13:17:04] zeljkof: Thank you for deploy. [13:18:32] 10Operations, 10Traffic, 10Wikimedia-Site-requests: Optimize Wikipedia PNG Logo - https://phabricator.wikimedia.org/T170546#3435217 (10Urbanecm) >>! In T170546#3435139, @Rap2h wrote: > Ok. I'm not sure to understand what you suggest me to do. Can I open a pull request on this Github repo after optimizing all... [13:18:37] (03PS1) 10Elukey: role::mariadb::misc::eventlogging: remove the readonly constraint for slaves [puppet] - 10https://gerrit.wikimedia.org/r/365007 (https://phabricator.wikimedia.org/T156933) [13:19:06] marostegui: --^ [13:19:17] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga loses downtime entries, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3435220 (10Volans) p:05Unbreak!>03High So after a lot of digging and live debugging between @akosiaris and me, we *think* to have an explanati... [13:21:00] akosiaris: ^^^ feel free to add anything I might have left [13:21:46] elukey: checking [13:21:48] (03PS1) 10Ema: cache::misc: enable nginx lua support [puppet] - 10https://gerrit.wikimedia.org/r/365009 [13:22:08] I am running pcc [13:22:24] 10Operations, 10Traffic, 10Wikimedia-Site-requests: Optimize Wikipedia PNG Logo - https://phabricator.wikimedia.org/T170546#3435228 (10Framawiki) (Priority levels corresponds to the urgency of the task and the speed required, particularly in #wikimedia-site-requests) [13:22:35] (03PS5) 10Elukey: Clone wikistats v2 repository and link it to v2 [puppet] - 10https://gerrit.wikimedia.org/r/362118 (https://phabricator.wikimedia.org/T167684) (owner: 10Milimetric) [13:23:39] marostegui: https://puppet-compiler.wmflabs.org/compiler02/7051/ [13:23:45] but dbstore is missing, so probably another role [13:23:56] dbstore1002 is already set to OFF [13:23:58] I was checking why [13:24:03] ahhh [13:24:14] (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler02/7050/ looks good but only db1047 gets changed" [puppet] - 10https://gerrit.wikimedia.org/r/365007 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [13:24:20] haha I ran the compiler before you :p [13:25:01] :D :D this is why I was waiting! :P [13:25:14] (03CR) 10Elukey: [C: 032] Clone wikistats v2 repository and link it to v2 [puppet] - 10https://gerrit.wikimedia.org/r/362118 (https://phabricator.wikimedia.org/T167684) (owner: 10Milimetric) [13:25:20] elukey: looks like the dbstore role already has read_only=off [13:25:57] marostegui: default is readonly=false right? might make sense [13:26:12] (03CR) 10Marostegui: [C: 031] "dbstore uses dbstore.my.cnf.erb which is already set to OFF; so no need to change it." [puppet] - 10https://gerrit.wikimedia.org/r/365007 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [13:27:04] \o/ [13:27:44] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-MarcoAurelio, 10User-Urbanecm: Logo for sr.wikiquote.org - https://phabricator.wikimedia.org/T168444#3435308 (10Urbanecm) >>! In T168444#3435164, @Obsuser wrote: > Different alignment of non-text logo part as well as color difference can... [13:28:47] (03CR) 10Ema: [C: 032] cache::misc: enable nginx lua support [puppet] - 10https://gerrit.wikimedia.org/r/365009 (owner: 10Ema) [13:28:54] (03PS2) 10Ema: cache::misc: enable nginx lua support [puppet] - 10https://gerrit.wikimedia.org/r/365009 [13:28:57] (03CR) 10Ema: [V: 032 C: 032] cache::misc: enable nginx lua support [puppet] - 10https://gerrit.wikimedia.org/r/365009 (owner: 10Ema) [13:29:08] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga loses downtime entries, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3225440 (10Marostegui) Nice troubleshooting @Volans and @akosiaris! Let's hope this is the root cause and we can move on :-) [13:30:04] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga loses downtime entries, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3435352 (10jcrespo) That is a very satisfying explanation, and it would fit 3 things I suspected: * That it happened on both servers (so it was n... [13:30:41] (03CR) 10Andrew Bogott: [C: 04-2] Enable base::firewall for labtestpuppetmaster2001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/364946 (owner: 10Muehlenhoff) [13:31:08] (03PS2) 10Elukey: role::mariadb::misc::eventlogging: remove the readonly constraint for slaves [puppet] - 10https://gerrit.wikimedia.org/r/365007 (https://phabricator.wikimedia.org/T156933) [13:34:54] elukey: let me know when you want me to change that setting on db1047 [13:34:58] RECOVERY - Check systemd state on labtestpuppetmaster2001 is OK: OK - running: The system is fully operational [13:35:29] (03CR) 10Andrew Bogott: "Ferm was definitely not running this morning, but I restarted it and now the firewall looks right. Thanks!" [dns] - 10https://gerrit.wikimedia.org/r/364798 (owner: 10Andrew Bogott) [13:35:36] (03CR) 10Elukey: [C: 032] role::mariadb::misc::eventlogging: remove the readonly constraint for slaves [puppet] - 10https://gerrit.wikimedia.org/r/365007 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [13:35:48] RECOVERY - Check whether ferm is active by checking the default input chain on labtestpuppetmaster2001 is OK: OK ferm input default policy is set [13:36:02] marostegui: merging, you can do it whenever you wish :) [13:36:23] ok changed [13:36:59] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-MarcoAurelio, 10User-Urbanecm: Logo for sr.wikiquote.org - https://phabricator.wikimedia.org/T168444#3435373 (10Obsuser) >>! In T168444#3435308, @Urbanecm wrote: > When comparing https://upload.wikimedia.org/wikipedia/commons/thumb/4/4f/... [13:37:42] marostegui: thanks! [13:39:29] (03Abandoned) 10Muehlenhoff: Enable base::firewall for labtestpuppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/364946 (owner: 10Muehlenhoff) [13:40:15] 10Operations, 10Traffic, 10Wikimedia-Site-requests: Optimize Wikipedia PNG Logo - https://phabricator.wikimedia.org/T170546#3435376 (10Rap2h) Ok thanks to all! (I'm still discovering the contribution rules, so thanks to help and mentor me!) > can you document a process we could use to apply this universall... [13:41:11] ACKNOWLEDGEMENT - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 26 connecting: cp3009_v4, cp3009_v6 Giuseppe Lavagetto T148422 [13:41:11] ACKNOWLEDGEMENT - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 Giuseppe Lavagetto T148422 [13:41:11] ACKNOWLEDGEMENT - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 Giuseppe Lavagetto T148422 [13:41:11] ACKNOWLEDGEMENT - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 Giuseppe Lavagetto T148422 [13:41:11] ACKNOWLEDGEMENT - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 Giuseppe Lavagetto T148422 [13:41:12] ACKNOWLEDGEMENT - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 Giuseppe Lavagetto T148422 [13:41:12] ACKNOWLEDGEMENT - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 Giuseppe Lavagetto T148422 [13:43:39] ACKNOWLEDGEMENT - kartotherian endpoints health on maps-test2001 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (exp [13:43:39] }/{z}/{x}/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (e [13:43:39] eppe Lavagetto T169011 [13:43:40] ACKNOWLEDGEMENT - kartotherian endpoints health on maps-test2002 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (exp [13:43:41] }/{z}/{x}/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (e [13:43:41] eppe Lavagetto T169011 [13:43:43] ACKNOWLEDGEMENT - kartotherian endpoints health on maps-test2003 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (exp [13:43:44] }/{z}/{x}/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (e [13:43:44] eppe Lavagetto T169011 [13:43:46] ACKNOWLEDGEMENT - kartotherian endpoints health on maps-test2004 is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) is CRITICAL: Test get a tile in the middle of the ocean, with overzoom returned the unexpected status 400 (expecting: 200): /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) is CRITICAL: Test Small scaled map returned the unexpected status 400 (exp [13:43:47] }/{z}/{x}/{y}@{scale}x.{format} (default scaled tile) is CRITICAL: Test default scaled tile returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200): /{src}/info.json (tile service info for osm-pbf) is CRITICAL: Test tile service info for osm-pbf returned the unexpected status 400 (e [13:43:47] eppe Lavagetto T169011 [13:45:23] ahahaha [13:45:26] so my code works [13:45:36] _joe_: eppe Lavagetto ... [13:45:56] <_joe_> "wonderfully" [13:53:22] (03PS1) 10Elukey: statistics::sites::stat: fix the symlink for v2 [puppet] - 10https://gerrit.wikimedia.org/r/365013 (https://phabricator.wikimedia.org/T167684) [13:54:18] RECOVERY - Keyholder SSH agent on netmon2001 is OK: OK: Keyholder is armed with all configured keys. [13:54:44] (03CR) 10Elukey: [C: 032] statistics::sites::stat: fix the symlink for v2 [puppet] - 10https://gerrit.wikimedia.org/r/365013 (https://phabricator.wikimedia.org/T167684) (owner: 10Elukey) [13:55:28] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:55:38] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:57:38] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:58:19] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:59:17] 10Operations, 10Traffic: Support TLSv1.3 - https://phabricator.wikimedia.org/T170567#3435393 (10BBlack) [13:59:58] should already be ok --^ [14:01:28] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:02:27] (03PS1) 10BBlack: ciphersuites: add TLSv1.3 variants in "high" list [puppet] - 10https://gerrit.wikimedia.org/r/365014 (https://phabricator.wikimedia.org/T170567) [14:02:38] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:02:38] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:03:19] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:03:51] it is weird that I don't see any issue in varnishkafka producers https://grafana.wikimedia.org/dashboard/db/varnishkafka?orgId=1&from=now-3h&to=now [14:04:27] (03PS4) 10Ottomata: Fix api_log_dir for statistics wmde [puppet] - 10https://gerrit.wikimedia.org/r/365001 (https://phabricator.wikimedia.org/T170472) (owner: 10Addshore) [14:04:33] (03CR) 10Ottomata: [V: 032 C: 032] Fix api_log_dir for statistics wmde [puppet] - 10https://gerrit.wikimedia.org/r/365001 (https://phabricator.wikimedia.org/T170472) (owner: 10Addshore) [14:04:58] RECOVERY - MariaDB Slave SQL: s3 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [14:05:47] 10Operations, 10Traffic, 10netops: Recurring varnish-be fetch failures in codfw - https://phabricator.wikimedia.org/T170131#3435430 (10elukey) A spike just happened and one interesting thing that I noticed is that the varnishkafka producers seems not affected: https://grafana.wikimedia.org/dashboard/db/varn... [14:07:39] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:07:51] 10Operations, 10Traffic, 10Wikimedia-Site-requests: Optimize Wikipedia PNG Logo - https://phabricator.wikimedia.org/T170546#3435445 (10Urbanecm) Well, for developing a process you aren't required to know what the tool does in the background (although it's better to know than don't know). When I decide to upd... [14:08:28] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:11:24] Hello, why https://mai.wikimedia.org/wiki/%E0%A4%B8%E0%A4%AE%E0%A5%8D%E0%A4%AE%E0%A5%81%E0%A4%96_%E0%A4%AA%E0%A4%A8%E0%A5%8D%E0%A4%A8%E0%A4%BE throws an error? The wiki was created by Dereckson yesterday and apache conf was merged today so it should work but it doesn't :( [14:12:10] <_joe_> Urbanecm: this seems like a db problem [14:12:15] <_joe_> as in a config problem [14:12:19] <_joe_> it has nothing to do with apache [14:12:49] Urbanecm: check this: https://phabricator.wikimedia.org/T168788#3435384 [14:13:10] the database is well created on the master and everything [14:13:38] !log rebooting graphite1* for kernel update [14:13:46] marostegui: I don't think this is related, there should be no account. But even wikis with no account should be readable... [14:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:10] _joe_: I know this is DB problem and it has nothing to do with apache. I'm wondering how to fix it [14:14:21] Urbanecm: ok :) [14:14:47] <_joe_> marostegui: yeah I am saying something is wrong in mediawiki-config is my best bet [14:15:11] yeah, most likely, I just double checked the DB itself just in case :) [14:15:32] Urbanecm: Reedy is a good one to ask about looking at these backtraces [14:15:49] Heh [14:15:53] Hello Reedy :) [14:15:55] I'll be something simple [14:16:05] Anyone want to place bets? [14:16:32] (03PS1) 10Urbanecm: Run optipng -o7 at all PNGs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365023 (https://phabricator.wikimedia.org/T170569) [14:17:09] Yup [14:17:09] Query: SELECT rt_revision FROM `revtag` WHERE rt_page = '1' AND rt_type = 'tp:mark' ORDER BY rt_revision DESC LIMIT 1 [14:17:09] Function: TranslatablePage::getTag [14:17:09] Error: 1146 Table 'maiwikimedia.revtag' doesn't exist (10.64.16.191) [14:17:15] Translate enabled, but tables not created? [14:17:19] Easy to fix [14:17:36] Reedy: Can you fix it? [14:17:40] Yeah [14:17:47] Just finding T168782 so I can !log it [14:17:48] T168782: Create fishbowl wiki for Maithili Wikimedians User Group - https://phabricator.wikimedia.org/T168782 [14:18:08] PROBLEM - High lag on wdqs2003 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] [14:18:23] 10Operations, 10ops-codfw: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307#3435510 (10faidon) [14:18:38] PROBLEM - High lag on wdqs2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] [14:18:48] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:19:01] !log `mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=maiwikimedia Translate` for T168782 [14:19:04] Urbanecm: fixed [14:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:18] (03PS2) 10BBlack: ssl_ciphersuite: limit ECDH curves where possible [puppet] - 10https://gerrit.wikimedia.org/r/361879 [14:20:01] So it was a db problem to th extent that the db didn't have the right tables [14:20:05] But that wasn't the db's fault ;) [14:20:53] Reedy: Thank you for fixing! [14:21:06] np [14:21:44] !log Run redact_sanitarium on db1069 and db1095 for maiwikimedia - T168788 [14:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:54] T168788: Prepare and check storage layer for maiwikimedia - https://phabricator.wikimedia.org/T168788 [14:23:38] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:23:54] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Create fishbowl wiki for Maithili Wikimedians User Group - https://phabricator.wikimedia.org/T168782#3435563 (10Urbanecm) [14:31:21] (03CR) 10Alexandros Kosiaris: [C: 04-1] "A few minor comments, rest LGTM" (033 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363217 (owner: 10Giuseppe Lavagetto) [14:33:23] 10Operations, 10Performance-Team, 10TemplateStyles, 10Traffic, and 3 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3435667 (10Fjalapeno) [14:34:02] 10Operations, 10Performance-Team, 10TemplateStyles, 10Traffic, and 3 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#2231032 (10Fjalapeno) @tgr is this still valid? [14:36:48] PROBLEM - High lag on wdqs2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] [14:38:09] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:38:58] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:38:59] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1661 bytes in 0.010 second response time [14:39:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [14:42:33] !log roll-restart cassandra in services-test to pick up renewed certs [14:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:01] PROBLEM - cassandra-b SSL 10.192.16.159:7001 on restbase-test2003 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [14:43:51] PROBLEM - cassandra-a SSL 10.64.16.153:7001 on cerium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [14:44:01] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:44:01] RECOVERY - cassandra-b SSL 10.192.16.159:7001 on restbase-test2003 is OK: SSL OK - Certificate restbase-test2003-b valid until 2018-07-13 14:23:47 +0000 (expires in 364 days) [14:44:42] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:44:51] RECOVERY - cassandra-a SSL 10.64.16.153:7001 on cerium is OK: SSL OK - Certificate cerium-a valid until 2018-07-13 14:23:32 +0000 (expires in 364 days) [14:45:52] PROBLEM - cassandra-a SSL 10.64.16.188:7001 on praseodymium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [14:46:22] PROBLEM - MariaDB Slave SQL: s3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1007, Errmsg: Error Cant create database maiwikimedia: database exists on query. Default database: maiwikimedia. [Query snipped] [14:46:49] ^ I will fix that [14:46:52] RECOVERY - cassandra-a SSL 10.64.16.188:7001 on praseodymium is OK: SSL OK - Certificate praseodymium-a valid until 2018-07-13 14:23:35 +0000 (expires in 364 days) [14:46:53] it was expected [14:47:14] I was doing that [14:47:24] who ever does it- please let's log it [14:47:27] in advance [14:47:29] jynus: I will do it :) [14:47:46] I was writing "root@dbstore2001[(none)]> SET GLOBAL" [14:47:51] haha [14:47:54] look what I have: [14:48:00] (reverse-i-search)`ski': set global sql_slave_skip_counter = 1; [14:48:02] XDDDD [14:48:38] so let's agree on !log first [14:48:50] then do to avoid creating an issue where there wasn't one [14:48:50] I will: !log Skip maiwikimedia database creation which is breaking dbstore2001 replication - T168788 [14:48:51] T168788: Prepare and check storage layer for maiwikimedia - https://phabricator.wikimedia.org/T168788 [14:48:57] that sounds good? [14:49:11] i agree, though, it is a good method [14:49:26] !log Skip maiwikimedia database creation which is breaking dbstore2001 replication - T168788 [14:49:27] wait, did that was just got logge [14:49:34] a ,no [14:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:42] just the ticket reference [14:50:14] yeah [14:50:18] So, I will run it [14:50:20] ok? [14:51:17] yes [14:51:21] (03PS1) 10Mforns: Remove 2 schemas from EventLogging purging white-list [puppet] - 10https://gerrit.wikimedia.org/r/365027 (https://phabricator.wikimedia.org/T156933) [14:51:22] ok [14:51:38] done [14:52:31] RECOVERY - MariaDB Slave SQL: s3 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [14:55:08] (03Abandoned) 10Mforns: Remove 2 schemas from EventLogging purging white-list [puppet] - 10https://gerrit.wikimedia.org/r/365027 (https://phabricator.wikimedia.org/T156933) (owner: 10Mforns) [14:56:42] (03PS1) 10Ema: WIP: base::kernel: add base::kernel::module [puppet] - 10https://gerrit.wikimedia.org/r/365030 [14:58:41] (03CR) 10jerkins-bot: [V: 04-1] WIP: base::kernel: add base::kernel::module [puppet] - 10https://gerrit.wikimedia.org/r/365030 (owner: 10Ema) [15:01:04] 10Operations, 10Performance-Team, 10TemplateStyles, 10Traffic, and 3 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3435845 (10Tgr) It's blocked on the two open subtasks, T164790 and T164791 (arguably the latter is less important and could be done post-deploy as... [15:02:21] PROBLEM - HHVM rendering on mw1280 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [15:03:22] RECOVERY - HHVM rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 75025 bytes in 0.702 second response time [15:04:42] PROBLEM - puppetmaster backend https on labtestpuppetmaster2001 is CRITICAL: connect to address 208.80.153.108 and port 8141: Connection refused [15:05:31] PROBLEM - puppetmaster https on labtestpuppetmaster2001 is CRITICAL: connect to address 208.80.153.108 and port 8140: Connection refused [15:07:10] (03PS1) 10Jcrespo: [WIP]mariadb: Add grants for rddmark to m1 [puppet] - 10https://gerrit.wikimedia.org/r/365035 (https://phabricator.wikimedia.org/T170158) [15:07:54] (03PS1) 10Marostegui: s6.hosts: Add labsdb1009 and labsdb1010 [software] - 10https://gerrit.wikimedia.org/r/365036 (https://phabricator.wikimedia.org/T153743) [15:16:56] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [15:18:16] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [15:20:05] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2019 - https://phabricator.wikimedia.org/T170503#3435969 (10Papaul) a:03Marostegui Disk replacement complete [15:21:30] (03PS1) 10Alexandros Kosiaris: puppetmaster: profilize puppetdb hiera lookups [puppet] - 10https://gerrit.wikimedia.org/r/365041 [15:23:37] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2019 - https://phabricator.wikimedia.org/T170503#3433715 (10jcrespo) Thanks papaul, do you have an idea of how many 600 gb disks like these are left? I do not need an exact count, "0, 1, 2 or many" is specific enough. I think chris is short of them I wonde... [15:23:47] RECOVERY - puppetmaster backend https on labtestpuppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 330 bytes in 0.194 second response time [15:24:16] RECOVERY - puppetmaster https on labtestpuppetmaster2001 is OK: HTTP OK: Status line output matched 400 - 331 bytes in 0.186 second response time [15:24:16] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:24:56] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:25:16] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2019 - https://phabricator.wikimedia.org/T170503#3435983 (10Marostegui) Thanks @Papaul, looks like that after our doubts to identify the correct disk, we changed the correct one :-) ``` Code: 0x0000006a Class: 0 Locale: 0x02 Event Description: Rebuild auto... [15:25:44] (03CR) 10Alexandros Kosiaris: [C: 032] "PCC fine at https://puppet-compiler.wmflabs.org/compiler02/7057/, merging" [puppet] - 10https://gerrit.wikimedia.org/r/365041 (owner: 10Alexandros Kosiaris) [15:25:53] 10Operations, 10ops-codfw: pc2006 crashed - https://phabricator.wikimedia.org/T170520#3435985 (10Marostegui) I have talked to Papaul about this and he'll contact Dell next week [15:32:10] 10Operations, 10Performance-Team, 10TemplateStyles, 10Traffic, and 3 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3436015 (10dr0ptp4kt) I'm in favor of deployment to wikitech for starters, where the highly technical crowd can handle the complexity. @Deskana @j... [15:33:49] !log Deploy alter table on s1 - labsdb1009 - T166204 [15:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:00] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [15:34:15] (03CR) 10Marostegui: [C: 032] s6.hosts: Add labsdb1009 and labsdb1010 [software] - 10https://gerrit.wikimedia.org/r/365036 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [15:35:34] (03Merged) 10jenkins-bot: s6.hosts: Add labsdb1009 and labsdb1010 [software] - 10https://gerrit.wikimedia.org/r/365036 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [15:38:59] jouncebot: next [15:38:59] In 0 hour(s) and 21 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170713T1600) [15:45:57] 10Operations, 10Performance-Team, 10TemplateStyles, 10Traffic, and 3 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3436066 (10Tgr) I believe templates in Flow comments/summaries would be broken in both edit and read mode. But then, complex templates are almost... [15:47:31] 10Operations, 10Ops-Access-Requests: Requesting access to recommendation-api for nschaaf - https://phabricator.wikimedia.org/T170592#3436068 (10schana) [15:49:05] 10Operations, 10Ops-Access-Requests: Requesting access to recommendation-api for nschaaf - https://phabricator.wikimedia.org/T170592#3436068 (10schana) [15:49:10] 10Operations, 10Traffic, 10netops: Recurring varnish-be fetch failures in codfw - https://phabricator.wikimedia.org/T170131#3436107 (10elukey) Re-adding the comment after checking that all the cp hosts were pushing metrics to graphite (on some of them logster was erroring for lock not acquired). From https:... [15:54:56] PROBLEM - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 644.97 seconds [15:55:06] ^ alter table going on [15:56:22] down' it [15:56:29] it should work now well [15:56:38] yeah, just did it :) [15:56:45] 10Operations, 10ops-codfw: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307#3436118 (10Papaul) @Dzahn Received a call form the the dispatch team saying that they received part only for one server that they don't know what the ETA is for part for the other serve... [15:58:32] marostegui: I am also running the cleaner script :P [15:58:51] And db1047 thought it was going to have a quiet evening... [15:59:56] poor db1047 [16:00:05] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170713T1600). [16:00:05] Krinkle: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:02:18] 10Operations, 10Icinga, 10monitoring: Icinga loses downtime entries, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3436131 (10faidon) [16:02:28] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2019 - https://phabricator.wikimedia.org/T170503#3436132 (10Papaul) @jcrespo I do not have spare on site, there onces I am using are pulled from decom servers. [16:03:37] !log dzahn@neodymium conftool action : set/pooled=no; selector: name=mw2201.codfw.wmnet [16:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:46] !log dzahn@neodymium conftool action : set/pooled=no; selector: name=mw2202.codfw.wmnet [16:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:34] 10Operations, 10ops-codfw: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307#3436138 (10Dzahn) Thanks @papaul I have depooled both servers just in case. Let me know when we can repool them. [16:10:10] (03CR) 10Dzahn: icinga/role:mail::mx: add monitoring of exim queue size (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn) [16:15:04] jouncebot: next [16:15:04] In 0 hour(s) and 44 minute(s): Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170713T1700) [16:15:31] (03PS1) 10Andrew Bogott: Don't use puppetdb on labtestpuppetmaster2001. [puppet] - 10https://gerrit.wikimedia.org/r/365052 [16:15:33] (03PS1) 10Andrew Bogott: Puppetmaster: Fix apache config ssldir [puppet] - 10https://gerrit.wikimedia.org/r/365053 [16:16:37] PROBLEM - cassandra-a CQL 10.192.16.176:9042 on restbase2007 is CRITICAL: connect to address 10.192.16.176 and port 9042: Connection refused [16:17:06] PROBLEM - cassandra-a SSL 10.192.16.176:7001 on restbase2007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:17:41] looking ^^^ [16:18:16] PROBLEM - Check systemd state on restbase2007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:18:27] PROBLEM - cassandra-a service on restbase2007 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [16:19:04] !log Starting cassandra-a, restbase2007 (OOM) [16:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:16] RECOVERY - Check systemd state on restbase2007 is OK: OK - running: The system is fully operational [16:19:26] RECOVERY - cassandra-a service on restbase2007 is OK: OK - cassandra-a is active [16:20:36] (03CR) 10Andrew Bogott: [C: 032] Don't use puppetdb on labtestpuppetmaster2001. [puppet] - 10https://gerrit.wikimedia.org/r/365052 (owner: 10Andrew Bogott) [16:21:06] RECOVERY - cassandra-a SSL 10.192.16.176:7001 on restbase2007 is OK: SSL OK - Certificate restbase2007-a valid until 2017-09-12 15:35:50 +0000 (expires in 60 days) [16:21:46] RECOVERY - cassandra-a CQL 10.192.16.176:9042 on restbase2007 is OK: TCP OK - 0.036 second response time on 10.192.16.176 port 9042 [16:25:48] 10Operations, 10Ops-Access-Requests: Requesting access to recommendation-api for nschaaf - https://phabricator.wikimedia.org/T170592#3436267 (10DarTar) Approved, thanks. [16:29:36] (03PS1) 10Andrew Bogott: Labtest: Allow the new labtestpuppetmaster to reach mysql on labtestcontrol [puppet] - 10https://gerrit.wikimedia.org/r/365056 [16:31:03] (03CR) 10jerkins-bot: [V: 04-1] Labtest: Allow the new labtestpuppetmaster to reach mysql on labtestcontrol [puppet] - 10https://gerrit.wikimedia.org/r/365056 (owner: 10Andrew Bogott) [16:31:22] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-MarcoAurelio, 10User-Urbanecm: Logo for sr.wikiquote.org - https://phabricator.wikimedia.org/T168444#3436293 (10MarcoAurelio) a:05MarcoAurelio>03None [16:32:29] (03PS2) 10Andrew Bogott: Labtest: Allow the new labtestpuppetmaster to reach mysql on labtestcontrol [puppet] - 10https://gerrit.wikimedia.org/r/365056 [16:33:26] RECOVERY - Check systemd state on labstore2001 is OK: OK - running: The system is fully operational [16:34:23] !log labstore2001:~# systemctl disable lvm2-activation && systemctl disable lvm2-activation-early && systemctl reset-failed (slated to be reimaged by madhu -- this alert is non-actionable) [16:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:51] 10Operations, 10Traffic: Extending our HSTS value beyond ~1y - https://phabricator.wikimedia.org/T170598#3436309 (10BBlack) [16:42:13] (03PS3) 10BBlack: ssl_ciphersuite: limit ECDH curves where possible [puppet] - 10https://gerrit.wikimedia.org/r/361879 [16:45:33] (03CR) 10BBlack: [C: 032] ssl_ciphersuite: limit ECDH curves where possible [puppet] - 10https://gerrit.wikimedia.org/r/361879 (owner: 10BBlack) [16:48:45] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2019 - https://phabricator.wikimedia.org/T170503#3436419 (10Marostegui) 05Open>03Resolved And RAID back to optimal: ``` root@db2019:~# megacli -LDInfo -lALL -aALL Adapter 0 -- Virtual Drive Information: Virtual Drive: 0 (Target Id: 0) Name... [16:55:16] RECOVERY - MegaRAID on db2019 is OK: OK: optimal, 1 logical, 6 physical, WriteBack policy [16:55:36] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: connect to address tools.wmflabs.org and port 80: Connection refused [16:55:56] PROBLEM - tools nginx proxy health on tools.wmflabs.org is CRITICAL: connect to address tools.wmflabs.org and port 80: Connection refused [16:55:56] PROBLEM - HTTPS-wmflabs on tools.wmflabs.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170713T1700). [17:00:36] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3570 bytes in 0.008 second response time [17:00:40] ^ temporarily hack-fixed the tools.wmflabs.org issue above, it's from my commit [17:00:51] bd808: ^ [17:00:56] RECOVERY - tools nginx proxy health on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 205 bytes in 0.002 second response time [17:00:57] RECOVERY - HTTPS-wmflabs on tools.wmflabs.org is OK: SSL OK - Certificate *.wmflabs.org valid until 2017-10-16 15:41:05 +0000 (expires in 94 days) [17:01:54] bblack: thanks. I may have done something bad around the same time too that I'm undoing [17:02:59] nginx won't start on tools-proxy-01. looking [17:03:05] bd808: the issue is that my change assumes that jessie hosts that use nginx via the tlsproxy module use libssl1.1 [17:03:18] PROBLEM - puppet last run on elastic2009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:24] I think that's because the nginx package hasn't been updated there in a long time [17:03:26] PROBLEM - puppet last run on mw2152 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:26] PROBLEM - Check systemd state on nihal is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:03:26] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:33] (e.g. "apt-get install nginx-common") [17:03:36] PROBLEM - puppet last run on mw2194 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:37] PROBLEM - puppet last run on ms-be2022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:47] PROBLEM - puppet last run on mw2195 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:56] PROBLEM - puppet last run on db2059 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:03:56] PROBLEM - puppet last run on mw2189 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:07] PROBLEM - puppet last run on elastic2030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:16] PROBLEM - puppet last run on db2092 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:16] PROBLEM - puppet last run on mw2122 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:17] PROBLEM - puppet last run on elastic2027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:17] PROBLEM - puppet last run on mw2245 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:26] PROBLEM - puppet last run on restbase2007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:27] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:27] PROBLEM - puppet last run on pybal-test2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:27] PROBLEM - puppet last run on mw2172 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:36] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:36] PROBLEM - puppet last run on ms-be2035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:46] PROBLEM - puppet last run on mw2187 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:46] PROBLEM - puppet last run on cp2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:46] PROBLEM - puppet last run on db2060 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:47] PROBLEM - puppet last run on planet2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:57] PROBLEM - puppet last run on mw2227 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:04:57] PROBLEM - puppet last run on mw2237 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:06] PROBLEM - puppet last run on db2023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:07] PROBLEM - puppet last run on db2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:07] PROBLEM - puppet last run on db2067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:16] PROBLEM - puppet last run on mw2244 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:16] PROBLEM - puppet last run on elastic2028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:17] PROBLEM - puppet last run on restbase2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:17] PROBLEM - puppet last run on mw2163 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:17] PROBLEM - puppet last run on mw2101 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:17] PROBLEM - puppet last run on mw2125 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:31] (03PS3) 10Andrew Bogott: Labtest: Allow the new labtestpuppetmaster to reach mysql on labtestcontrol [puppet] - 10https://gerrit.wikimedia.org/r/365056 [17:05:36] PROBLEM - puppet last run on cp2021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:36] PROBLEM - puppet last run on mc2028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:36] PROBLEM - puppet last run on mw2120 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:46] PROBLEM - puppet last run on mw2153 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:46] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:46] PROBLEM - puppet last run on db2029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:47] PROBLEM - puppet last run on elastic2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:56] PROBLEM - puppet last run on db2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:56] PROBLEM - puppet last run on elastic2033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:05:57] PROBLEM - puppet last run on db2083 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:00] what new hell is this puppet storm? [17:06:06] PROBLEM - puppet last run on graphite2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:06] PROBLEM - puppet last run on mw2203 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:06] PROBLEM - puppet last run on wtp2009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:07] PROBLEM - puppet last run on ms-be2015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:16] PROBLEM - puppet last run on wtp2020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:16] PROBLEM - puppet last run on wezen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:16] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:26] PROBLEM - puppet last run on mc2022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:26] PROBLEM - puppet last run on ms-be2037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:27] bd808: because nginx breaks puppet too, and the puppetdb hosts use an outdated nginx just like tools.wmflabs.org [17:06:32] !log disabled puppet on nitrogen [17:06:36] PROBLEM - puppet last run on db2061 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:36] PROBLEM - puppet last run on mw2199 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:37] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:37] PROBLEM - puppet last run on elastic2007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:46] PROBLEM - puppet last run on mw2211 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:46] PROBLEM - puppet last run on mw2241 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:46] PROBLEM - puppet last run on wtp2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:46] PROBLEM - puppet last run on ms-be2031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:53] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3436497 (10Cmjohnson) @jcrespo and @marostegui db1106 raid has been setup. [17:06:56] PROBLEM - puppet last run on db2033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:06:56] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:06] PROBLEM - puppet last run on ms-be2019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:07] PROBLEM - puppet last run on elastic2022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:07] PROBLEM - puppet last run on lvs4002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:17] PROBLEM - puppet last run on kubernetes2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:26] PROBLEM - puppet last run on mw2180 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:26] PROBLEM - puppet last run on db2080 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:26] PROBLEM - puppet last run on ganeti2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:26] RECOVERY - Check systemd state on nihal is OK: OK - running: The system is fully operational [17:07:27] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:36] PROBLEM - puppet last run on db2082 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:36] PROBLEM - puppet last run on es2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:37] PROBLEM - puppet last run on mw2097 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:46] PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:56] PROBLEM - puppet last run on elastic2015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:56] PROBLEM - puppet last run on mw2251 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:57] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:57] (03PS1) 10BBlack: Revert "ssl_ciphersuite: limit ECDH curves where possible" [puppet] - 10https://gerrit.wikimedia.org/r/365062 [17:08:07] PROBLEM - puppet last run on poolcounter2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:08:10] (03CR) 10BBlack: [V: 032 C: 032] Revert "ssl_ciphersuite: limit ECDH curves where possible" [puppet] - 10https://gerrit.wikimedia.org/r/365062 (owner: 10BBlack) [17:08:16] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:08:17] PROBLEM - puppet last run on rdb2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:08:17] PROBLEM - puppet last run on restbase2011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:08:36] PROBLEM - puppet last run on rdb2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:08:36] PROBLEM - puppet last run on elastic2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:08:38] PROBLEM - puppet last run on elastic2023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:08:38] PROBLEM - puppet last run on ms-be2039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:08:46] PROBLEM - puppet last run on graphite2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:08:46] PROBLEM - puppet last run on wtp2019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:08:47] PROBLEM - puppet last run on mw2155 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:08:56] PROBLEM - puppet last run on mc2029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:08:57] PROBLEM - puppet last run on elastic2011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:09:16] PROBLEM - puppet last run on mw2104 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:09:26] PROBLEM - puppet last run on mw2253 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:09:26] PROBLEM - puppet last run on mw2103 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:09:36] PROBLEM - puppet last run on mc2027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:09:36] PROBLEM - puppet last run on elastic2019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:10:07] is that normal? [17:12:21] no [17:14:06] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:14:09] (03PS4) 10Andrew Bogott: Labtest: Allow the new labtestpuppetmaster to reach mysql on labtestcontrol [puppet] - 10https://gerrit.wikimedia.org/r/365056 [17:18:10] (03PS1) 10RobH: labweb100[12] production dns entries [dns] - 10https://gerrit.wikimedia.org/r/365063 [17:20:46] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [17:21:49] (03CR) 10RobH: [C: 032] labweb100[12] production dns entries [dns] - 10https://gerrit.wikimedia.org/r/365063 (owner: 10RobH) [17:28:19] (03PS3) 10Umherirrender: Add ar_content_format and ar_content_model to labs views [puppet] - 10https://gerrit.wikimedia.org/r/363851 (https://phabricator.wikimedia.org/T89741) [17:28:47] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:29:26] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:30:26] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:30:37] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:30:46] herron, re the listadmin issues - can you see security tasks? [17:30:49] (on phabricator) [17:31:30] hmm, do you have an example so I can double check? [17:31:34] herron, https://phabricator.wikimedia.org/T170601 [17:31:51] that's a task for these actual issues, so if not I can add you. [17:31:51] looks like no [17:31:56] RECOVERY - puppet last run on ms-be2022 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [17:32:26] RECOVERY - puppet last run on db2092 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [17:32:36] herron, ok, added - should be able to see - there's some header info in there so flagged in case. [17:32:36] RECOVERY - puppet last run on restbase2007 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [17:32:36] RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [17:32:44] thanks! can see it now [17:32:46] RECOVERY - puppet last run on pybal-test2003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:32:46] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [17:32:46] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [17:32:47] RECOVERY - puppet last run on ms-be2035 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [17:32:56] RECOVERY - puppet last run on db2060 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:32:57] RECOVERY - puppet last run on planet2001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [17:33:05] 10Operations, 10DBA, 10Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3436693 (10Marostegui) Thanks @Cmjohnson I have restarted the host for its reinstallation. I will close this task when done. [17:33:06] RECOVERY - puppet last run on db2059 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:33:16] RECOVERY - puppet last run on db2023 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:33:17] herron, feel free to add whomever makes sense. Just wanted to make sure I wasn't leaking anything private :) [17:33:17] RECOVERY - puppet last run on db2067 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [17:33:26] RECOVERY - puppet last run on db2012 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [17:33:46] RECOVERY - puppet last run on cp2021 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [17:33:46] RECOVERY - puppet last run on mc2028 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [17:33:54] ok will do :) [17:33:56] RECOVERY - puppet last run on mw2153 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:34:16] RECOVERY - puppet last run on db2083 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:34:16] RECOVERY - puppet last run on graphite2001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:34:17] RECOVERY - puppet last run on wtp2009 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:34:26] RECOVERY - puppet last run on ms-be2015 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [17:34:26] RECOVERY - puppet last run on wezen is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [17:34:26] RECOVERY - puppet last run on wtp2020 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [17:34:36] RECOVERY - puppet last run on restbase2001 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [17:34:36] RECOVERY - puppet last run on mc2022 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [17:34:46] RECOVERY - puppet last run on ms-be2037 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:34:57] RECOVERY - puppet last run on db2029 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [17:35:06] RECOVERY - puppet last run on ms-be2031 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [17:35:06] RECOVERY - puppet last run on db2033 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:35:06] RECOVERY - puppet last run on db2017 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [17:35:16] RECOVERY - puppet last run on ms-be2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:35:26] RECOVERY - puppet last run on lvs4002 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [17:35:26] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:35:36] RECOVERY - puppet last run on kubernetes2004 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [17:35:36] RECOVERY - puppet last run on db2080 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:35:37] RECOVERY - puppet last run on ganeti2006 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:35:37] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [17:35:45] (03PS1) 10MarcoAurelio: High density logos for es.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365066 (https://phabricator.wikimedia.org/T170604) [17:35:46] RECOVERY - puppet last run on rdb2005 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [17:35:46] RECOVERY - puppet last run on db2061 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [17:35:46] RECOVERY - puppet last run on db2082 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [17:35:47] RECOVERY - puppet last run on es2012 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:35:56] RECOVERY - puppet last run on wtp2005 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:36:16] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [17:36:17] RECOVERY - puppet last run on poolcounter2001 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [17:36:20] !log restarting puppetmasters, staggered [17:36:26] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:36:26] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [17:36:27] RECOVERY - puppet last run on restbase2011 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [17:36:27] RECOVERY - puppet last run on rdb2002 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [17:36:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:34] (03CR) 10Andrew Bogott: [C: 032] Labtest: Allow the new labtestpuppetmaster to reach mysql on labtestcontrol [puppet] - 10https://gerrit.wikimedia.org/r/365056 (owner: 10Andrew Bogott) [17:36:36] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2014105 [17:36:37] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:36:46] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:36:56] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:36:56] RECOVERY - puppet last run on graphite2002 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [17:36:56] RECOVERY - puppet last run on wtp2019 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:37:06] RECOVERY - puppet last run on mw2155 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [17:37:06] RECOVERY - puppet last run on mc2029 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [17:37:46] RECOVERY - puppet last run on mc2027 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:37:56] RECOVERY - puppet last run on ms-be2039 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [17:38:03] !log restarting varnish-be on cp1049 (mailbox lag) [17:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:20] 10Operations, 10DC-Ops: Information missing from racktables - https://phabricator.wikimedia.org/T150651#3436718 (10RobH) All of the cp40XX are fixed. [17:41:26] RECOVERY - MariaDB Slave Lag: s1 on db1047 is OK: OK slave_sql_lag Replication lag: 0.06 seconds [17:42:16] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [17:46:33] !log re-enabling puppet and force run on 'R:Package = nginx-common' [17:46:36] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 0 [17:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:36] RECOVERY - puppet last run on elastic2027 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [17:48:36] RECOVERY - puppet last run on mw2180 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [17:48:56] RECOVERY - puppet last run on elastic2007 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [17:49:06] RECOVERY - puppet last run on mw2195 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [17:49:16] RECOVERY - puppet last run on elastic2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:49:16] RECOVERY - puppet last run on elastic2015 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [17:49:16] RECOVERY - puppet last run on mw2251 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [17:49:16] RECOVERY - puppet last run on mw2189 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [17:49:26] RECOVERY - puppet last run on mw2203 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [17:49:36] RECOVERY - puppet last run on mw2104 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [17:49:36] RECOVERY - puppet last run on elastic2028 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:49:37] RECOVERY - puppet last run on mw2245 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:49:46] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [17:49:56] RECOVERY - puppet last run on mw2172 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [17:49:56] RECOVERY - puppet last run on elastic2019 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [17:50:06] RECOVERY - puppet last run on cp2005 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [17:50:17] RECOVERY - puppet last run on mw2227 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [17:50:17] RECOVERY - puppet last run on mw2237 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:50:26] RECOVERY - puppet last run on elastic2022 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [17:50:27] RECOVERY - puppet last run on elastic2030 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [17:50:36] RECOVERY - puppet last run on mw2244 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [17:50:37] RECOVERY - puppet last run on mw2101 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [17:50:46] RECOVERY - puppet last run on mw2103 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [17:50:46] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [17:50:57] RECOVERY - puppet last run on mw2199 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [17:51:06] RECOVERY - puppet last run on mw2194 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [17:51:06] RECOVERY - puppet last run on mw2187 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [17:51:16] RECOVERY - puppet last run on elastic2033 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [17:51:36] RECOVERY - puppet last run on mw2122 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [17:51:46] RECOVERY - puppet last run on mw2163 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:51:46] RECOVERY - puppet last run on elastic2009 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [17:51:56] RECOVERY - puppet last run on elastic2005 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:51:57] RECOVERY - puppet last run on mw2097 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [17:51:57] RECOVERY - puppet last run on mw2120 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [17:52:06] RECOVERY - puppet last run on mw2241 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [17:52:06] RECOVERY - puppet last run on mw2211 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [17:52:16] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:52:16] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [17:52:17] RECOVERY - puppet last run on elastic2011 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [17:52:46] RECOVERY - puppet last run on mw2253 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [17:52:46] RECOVERY - puppet last run on mw2125 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [17:53:06] RECOVERY - puppet last run on elastic2023 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [17:54:27] (03PS4) 10Rush: Add ar_content_format and ar_content_model to labs views [puppet] - 10https://gerrit.wikimedia.org/r/363851 (https://phabricator.wikimedia.org/T89741) (owner: 10Umherirrender) [17:56:56] 10Operations, 10Commons: ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION - https://phabricator.wikimedia.org/T170605#3436873 (10Framawiki) [17:58:16] 10Operations, 10Commons: ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION - https://phabricator.wikimedia.org/T170605#3436479 (10Framawiki) @MaxSem `purgeList.php` ? [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170713T1800). Please do the needful. [18:00:04] Jdlrobson and RainbowSprinkles: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:01:50] Yay, MinervaNeue! [18:03:37] (03CR) 10Urbanecm: [C: 031] "LGTM, thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365066 (https://phabricator.wikimedia.org/T170604) (owner: 10MarcoAurelio) [18:04:22] Urbanecm: as I run on windows, I couldn't wget, but simply downloaded the 225x and 300x files gotten from the commons url, hope that's fine [18:04:55] TabbyCat: There should be no difference between wget and save [18:05:04] good :) [18:05:06] BTW I've left you a node at #wikimedia-dev [18:05:20] looks [18:08:04] 10Operations, 10ops-codfw, 10Parsoid, 10Patch-For-Review, 10Services (watching): wtp2019 - hardware (RAM) check - https://phabricator.wikimedia.org/T146113#2651101 (10Arlolra) About to deploy but that patch is merged and wtp2019 is still pooled, ``` {"wtp2019.codfw.wmnet": {"pooled": "yes", "weight": 10... [18:08:59] 10Operations, 10ops-ulsfo, 10DC-Ops: determine model/serial info for kvm-ulsfo - https://phabricator.wikimedia.org/T170613#3436971 (10RobH) [18:11:30] !log arlolra@tin Started deploy [parsoid/deploy@d0041f2]: Updating Parsoid to 71c07681 [18:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:59] For today's SWAtter, I have something to deploy. From some reason I wasn't pinged by jouncebot so just to be sure you know about me [18:22:41] !log arlolra@tin Finished deploy [parsoid/deploy@d0041f2]: Updating Parsoid to 71c07681 (duration: 11m 12s) [18:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:41] (03PS9) 10Dzahn: icinga/role:mail::mx: add monitoring of exim queue size [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) [18:24:16] (03CR) 10Dzahn: "changed the sudo line to have a wildcard. "exipick*"" [puppet] - 10https://gerrit.wikimedia.org/r/361023 (https://phabricator.wikimedia.org/T133110) (owner: 10Dzahn) [18:28:42] Dereckson: why are you a bot on wikitech ? [18:29:15] 10Operations, 10Commons: ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION - https://phabricator.wikimedia.org/T170605#3437054 (10MaxSem) Instead of purging blindly, I'd rather investigate. And not that it would help, considering that I inspected the headers coming from all 4 datacenters, no trace of duplicate... [18:29:20] !log upgrading nginx on +wmf1 hosts: conf[1001-1003].eqiad.wmnet,cp1048.eqiad.wmnet,cp3036.esams.wmnet,elastic2020.codfw.wmnet,hassaleh.codfw.wmnet,hassium.eqiad.wmnet [18:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:37] (03CR) 10Dzahn: "'good that we also add the monitoring' was my first thought, because i remember from years ago when running a similar stack that it was ki" [puppet] - 10https://gerrit.wikimedia.org/r/364827 (https://phabricator.wikimedia.org/T170462) (owner: 10Herron) [18:30:52] bawolff: maybe not 'bot' but 'boss', check the group ;) [18:31:08] !log Updated Parsoid to 71c07681 (T169293) [18:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:18] T169293: While using ParsoidBatchAPI, [[Media: ]] does not link to media file anymore - https://phabricator.wikimedia.org/T169293 [18:31:22] (03PS6) 10Dzahn: Gerrit: Add gerrit pub key for ssh [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox) [18:31:32] TabbyCat: ? [18:31:48] [20:28] bawolff Dereckson: why are you a bot on wikitech ? [18:31:58] I'm checking the logs and found nothing though [18:32:12] That's part of why its mysterious [18:32:18] (03CR) 10Dzahn: [C: 032] "per IRC talk - yea...i didn't intend to make you change all of scap for just gerrit" [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox) [18:32:41] bawolff: granted from DB? [18:32:52] That, or someone suppressed the log entries [18:33:07] Let me check, I'm an oversighter there [18:33:26] (03CR) 10Dzahn: [V: 032 C: 032] Gerrit: Add gerrit pub key for ssh [labs/private] - 10https://gerrit.wikimedia.org/r/363755 (owner: 10Paladox) [18:33:32] but I never suppressed anything but harassing usernames [18:34:14] I think db would be more likely, because why would anyone suppress that? [18:34:47] (03PS1) 10BBlack: Revert "Revert "ssl_ciphersuite: limit ECDH curves where possible"" [puppet] - 10https://gerrit.wikimedia.org/r/365078 [18:34:56] (03PS2) 10BBlack: Revert "Revert "ssl_ciphersuite: limit ECDH curves where possible"" [puppet] - 10https://gerrit.wikimedia.org/r/365078 [18:35:30] (03CR) 10BBlack: [C: 04-1] "Still want to do this, but notebook100[12] and various labs hosts need nginx package upgrades first or they break on this." [puppet] - 10https://gerrit.wikimedia.org/r/365078 (owner: 10BBlack) [18:35:42] nothing in the logs [18:35:52] a bureaucrat could deflag him [18:36:08] 10Operations, 10ops-codfw, 10Parsoid, 10Patch-For-Review, 10Services (watching): wtp2019 - hardware (RAM) check - https://phabricator.wikimedia.org/T146113#3437064 (10Arlolra) Today's deploy is done. Note that before repooling, you'll need to update Parsoid to the latest commit on wtp2019. Thanks! [18:36:48] (03PS1) 10RobH: scs-oe11-esams had old dns name of scs1 [dns] - 10https://gerrit.wikimedia.org/r/365079 [18:37:39] Well for all I know he has a legit reason for being a secret bot :) [18:38:26] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2201.codfw.wmnet [18:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:01] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2202.codfw.wmnet [18:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:59] Its too bad botflag is only recorded in rc and not revision, or it'd be able to figure out when exactly this happened [18:40:05] 10Operations, 10ops-codfw: mw2201, mw2202 - contact Dell and replace main board - https://phabricator.wikimedia.org/T170307#3437074 (10Dzahn) Dell has told Papaul that it was delayed and they won't be there today. Repooled servers for now. [18:45:19] https://wikitech.wikimedia.org/w/index.php?title=Special:DoubleRedirects&limit=500&offset=0 <-- O_o [18:45:23] * TabbyCat fixes [18:49:16] (03PS1) 10Eevans: WIP: Configure an additional data file directory [puppet] - 10https://gerrit.wikimedia.org/r/365081 [18:55:47] (03PS2) 10Eevans: Configure an additional data file directory [puppet] - 10https://gerrit.wikimedia.org/r/365081 [18:57:14] (03CR) 10Eevans: [C: 031] "This is safe/ready to merge; PC output: http://puppet-compiler.wmflabs.org/7058/" [puppet] - 10https://gerrit.wikimedia.org/r/365081 (owner: 10Eevans) [19:00:04] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170713T1900). Please do the needful. [19:00:15] * thcipriani acks [19:00:48] 10Operations, 10MW-1.30-release-notes, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3346442 (10Framawiki) >>! In T167714#3424838, @aude wrote: > I updated the sites table 2 weeks ago. interwiki links shou... [19:06:50] wikitech double and broken redirects fixed [19:07:44] (03PS27) 10Paladox: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) [19:08:04] (03PS1) 10Urbanecm: Provide HD logos for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365084 (https://phabricator.wikimedia.org/T150618) [19:18:02] (03PS2) 10Urbanecm: Provide HD logos for several Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365084 (https://phabricator.wikimedia.org/T150618) [19:19:29] (03PS1) 10Urbanecm: Provide HD logos for several Wikiversities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365086 [19:20:00] (03PS2) 10Urbanecm: Provide HD logos for several Wikiversities [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365086 (https://phabricator.wikimedia.org/T150618) [19:21:21] 10Operations, 10netops: pfw-codfw still logging to indium - https://phabricator.wikimedia.org/T170622#3437338 (10Jgreen) [19:23:49] (03PS1) 10Andrew Bogott: Set labs_puppet_master for labtestpuppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/365088 [19:26:57] (03CR) 10Andrew Bogott: [C: 032] Set labs_puppet_master for labtestpuppetmaster2001 [puppet] - 10https://gerrit.wikimedia.org/r/365088 (owner: 10Andrew Bogott) [19:49:08] (03PS1) 10Urbanecm: Add HD logos for several Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365092 (https://phabricator.wikimedia.org/T150618) [19:49:40] jdlrobson: We got all our backports ready to go? [19:49:51] And I guess we need the wmf-config swap too (we could start on testwiki to be safe) [19:50:01] (03CR) 10jerkins-bot: [V: 04-1] Add HD logos for several Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365092 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [19:51:10] (03PS1) 10Chad: Move testwiki to MinervaNeue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365093 [19:55:31] (03PS2) 10Urbanecm: Add HD logos for several Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365092 (https://phabricator.wikimedia.org/T150618) [19:58:38] 10Puppet, 10Cloud-VPS: Invesitgate use of Puppet "environments" for per-project Puppet manifests - https://phabricator.wikimedia.org/T170370#3437479 (10bd808) [19:59:16] !log smalyshev@tin Started deploy [wdqs/wdqs@a32dbeb]: Redeploy GUI due to breakage in T165228 [19:59:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:29] T165228: Query results are downloaded in wrong encoding - https://phabricator.wikimedia.org/T165228 [19:59:39] (03CR) 10Krinkle: "It may be obvious but can you briefly describe for the record where these came from? E.g. what logic or process did you use? A particular " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365092 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [20:01:35] !log smalyshev@tin Finished deploy [wdqs/wdqs@a32dbeb]: Redeploy GUI due to breakage in T165228 (duration: 02m 19s) [20:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:07] (03PS1) 10Ottomata: Remove base::firewall from stat private boxes [puppet] - 10https://gerrit.wikimedia.org/r/365095 (https://phabricator.wikimedia.org/T170496) [20:03:30] (03CR) 10Ottomata: [V: 032 C: 032] Remove base::firewall from stat private boxes [puppet] - 10https://gerrit.wikimedia.org/r/365095 (https://phabricator.wikimedia.org/T170496) (owner: 10Ottomata) [20:05:00] 10Operations, 10Traffic, 10media-storage: HTTP 429 on image on Commons - https://phabricator.wikimedia.org/T170628#3437497 (10Superzerocool) [20:06:00] (03PS6) 10D3r1ck01: Remove 'din' from wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362876 (https://phabricator.wikimedia.org/T168523) [20:06:27] 10Operations, 10Traffic, 10media-storage: HTTP 429 on image on Commons - https://phabricator.wikimedia.org/T170628#3437515 (10Superzerocool) [20:07:10] (03PS1) 10Ottomata: ensure => absent for base::firewall in statistics::private profile [puppet] - 10https://gerrit.wikimedia.org/r/365096 (https://phabricator.wikimedia.org/T170496) [20:08:22] (03CR) 10Ottomata: [V: 032 C: 032] ensure => absent for base::firewall in statistics::private profile [puppet] - 10https://gerrit.wikimedia.org/r/365096 (https://phabricator.wikimedia.org/T170496) (owner: 10Ottomata) [20:11:36] PROBLEM - Check whether ferm is active by checking the default input chain on stat1005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [20:12:06] PROBLEM - Check whether ferm is active by checking the default input chain on stat1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [20:15:13] jdlrobson: About? We doing this? [20:19:04] (03CR) 10Urbanecm: "Hello, I've used simple Python script that looks for SVG file named File:Wiktionary-logo-.svg at commons. If there's one, it create " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365092 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [20:23:51] RainbowSprinkles: herre [20:25:01] Ok let's do this. All the backports to wmf.7 and 9 ready? [20:25:07] We'll merge those, sync, then swap testwiki [20:25:10] And if good, swap the rest [20:26:45] Yeah did you see my note about the Jenkins failure? [20:27:34] The test started flaking this week and can be ignored [20:27:52] Oh, no I hadn't. Links? [20:29:28] RainbowSprinkles: so first thing.. did https://gerrit.wikimedia.org/r/#/c/362448/ get put in wmf7 and 9? [20:29:42] ottomata: ferm not happy on stats hosts ;) ^^^ [20:29:56] hmmm [20:30:09] jdlrobson: No, I couldn't backport it cleanly, thought you would do it [20:30:18] I got a bit lost (as it's a dependency of the ones you raised and needs to be merged ). Browser tests will fail [20:30:23] So can't be done cleanly [20:30:56] So you were saying you couldn't cherry pick it to both branches? Hmmm [20:31:18] Gerrit complained about merge conflicts [20:31:55] Okay let me take a look. [20:32:17] Cleanest way would be to cherry pick all the patches not in master [20:32:45] * RainbowSprinkles nods [20:32:54] This is your party, I'm just here to help :) [20:33:03] Get the two branches how you like them, then we'll proceed [20:34:02] https://gerrit.wikimedia.org/r/#/q/I7f004b43e11d88492b205a3584c29f72d26bad57 I have backports prepped and ready to +2 [20:34:10] RainbowSprinkles: yup let's start with https://gerrit.wikimedia.org/r/#/c/365108/ [20:34:29] That will make it possible to cherry pick the other one [20:34:39] +2'd [20:36:01] And wmf9 https://gerrit.wikimedia.org/r/#/c/365110/ [20:39:11] +2 [20:39:47] PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1800.0] [20:41:01] (03PS1) 10Ottomata: Disable ferm check if base::firewall ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/365120 [20:41:13] volans: https://gerrit.wikimedia.org/r/#/c/365120/ [20:41:20] RainbowSprinkles: hey.. so second though [20:41:26] could we just do this in wmf9 [20:41:28] (03PS5) 10Dzahn: admin: Remove deployers from restricted group [puppet] - 10https://gerrit.wikimedia.org/r/364469 (https://phabricator.wikimedia.org/T104671) (owner: 10Dereckson) [20:41:29] e.g. skip enwiki for today? [20:41:41] or does it need to be done in both? [20:42:11] I was doing both for consistency in l10n cache. And with wmf.9 halted, could be ugly to differ between the two [20:42:22] RainbowSprinkles: ok fair enough [20:42:29] wmf9 ready to go: https://gerrit.wikimedia.org/r/365122 [20:42:59] wmf7 is just very very messy [20:43:06] Yeah, I feel ya [20:43:07] im not sure how to fix it [20:43:17] Hmm, we can just skip wmf.7 [20:43:18] That's fine [20:43:19] i18n scares me [20:43:34] We should be fine, we'll just skip group2 for enabling [20:43:47] okay great. That would make me much less scared [20:44:27] So what about https://gerrit.wikimedia.org/r/#/q/I7f004b43e11d88492b205a3584c29f72d26bad57? [20:45:24] yup that needs to land in 9 too [20:45:39] (03CR) 10Dzahn: [C: 032] admin: Remove deployers from restricted group [puppet] - 10https://gerrit.wikimedia.org/r/364469 (https://phabricator.wikimedia.org/T104671) (owner: 10Dereckson) [20:45:58] So, what happens if we sync MobileFrontend without enabling the skin? Does it fail ok still? [20:46:00] With the removal? [20:46:17] RainbowSprinkles: it will throw an exception when you hit the mobile site [20:46:26] so the skin will need to be there first [20:46:41] Can MobileFrontend load the skin without the skin extension enabled? [20:47:06] RainbowSprinkles: okay, so let's rewind [20:47:08] https://gerrit.wikimedia.org/r/365110 is harmless [20:47:12] no problems with merging that [20:47:17] (03PS2) 10Dzahn: fix IPv6 reverse record for kraz.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/364904 [20:47:43] RainbowSprinkles: if we then merge https://gerrit.wikimedia.org/r/365122 the mobile site will still function but some of the JS will break [20:47:53] (provided the skin is enabled) [20:48:11] So, long as we don't backport that stuff to wmf.7, should be ok? [20:48:32] https://gerrit.wikimedia.org/r/364921 will fix the JS problems [20:48:48] yes. So on wmf7 if we enable Minerva there [20:48:58] what should happen is Minerva will be ignored [20:49:04] lemme double check that though [20:49:55] Hmm. I'm getting a little apprehensive about this... [20:50:02] There's a *lot* of moving parts [20:50:10] (03PS3) 10Dzahn: fix IPv6 reverse record for kraz.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/364904 [20:50:55] wmf7 works fine with Minerva enabled [20:51:29] So, we just enable everywhere? I'm worried what happens if we roll out MobileFrontend to wmf.9 without enabling the skin [20:52:14] (03CR) 10Dzahn: [C: 032] fix IPv6 reverse record for kraz.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/364904 (owner: 10Dzahn) [20:53:15] (03PS4) 10Dzahn: fix IPv6 reverse record for kraz.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/364904 [20:53:19] RainbowSprinkles: so https://gerrit.wikimedia.org/r/365122 is responsible for delegating the skin to Minerva [20:53:30] until then Minerva will sit quietly [20:53:35] (03CR) 10Volans: [C: 04-1] "I don't think this is the right solution and that the problem is instead https://gerrit.wikimedia.org/r/#/c/365096/ itself:" [puppet] - 10https://gerrit.wikimedia.org/r/365120 (owner: 10Ottomata) [20:53:38] (03PS3) 10Urbanecm: Add HD logos for several Wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365092 (https://phabricator.wikimedia.org/T150618) [20:53:43] But, it's not enabled. Should we preemptively enable it everywhere? [20:53:47] And *then* sync the code? [20:53:49] RainbowSprinkles: yes. I think that would be wise [20:53:54] Ok, we'll do that. [20:54:19] (03CR) 10Chad: [C: 032] Move testwiki to MinervaNeue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365093 (owner: 10Chad) [20:54:24] Starting with testwiki [20:54:26] To make sure [20:55:36] (03Merged) 10jenkins-bot: Move testwiki to MinervaNeue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365093 (owner: 10Chad) [20:55:39] 10Operations, 10Cassandra, 10RESTBase-Cassandra, 10Patch-For-Review, 10Services (attic): Evaluate Brotli compression for Cassandra - https://phabricator.wikimedia.org/T125906#3437725 (10Eevans) 05Open>03declined [20:55:40] RainbowSprinkles: makes sense :) [20:55:51] 10Operations, 10Wikimedia-IRC-RC-Server, 10IPv6, 10Patch-For-Review: enable IPv6 on irc.wikimedia.org - https://phabricator.wikimedia.org/T105422#3437728 (10Dzahn) ``` [kraz:~] $ host kraz.wikimedia.org kraz.wikimedia.org has address 208.80.153.44 kraz.wikimedia.org has IPv6 address 2620:0:860:2:208:80:153... [20:56:06] (03CR) 10Dzahn: "[kraz:~] $ host kraz.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/364904 (owner: 10Dzahn) [20:56:12] brb grabbing a coffeee having caffeine withdrawal headaches [20:56:30] I've already had 3 today [20:56:43] had 0 [20:56:44] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: MinervaNeue on testwiki (duration: 00m 47s) [20:56:52] cant cold turkey it [20:56:54] (03CR) 10Urbanecm: "I've removed HD logos which are different from non-HD logos for safety, will ask for SVG's later. I also standardized jawiktionary's logo " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365092 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [20:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:20] Ok, it's installed but I'm missing l10n [20:57:20] jdlrobson: I feel ya there. I was so addicted if I didn't have caffeine by 9am I had a migraine the rest of the day (the severity was probably increased by other health issues). I quit cold turn one week I had the flu. [20:57:27] ⧼minerva-skin-desc⧽ [20:57:49] RainbowSprinkles: where you seeing that? [20:57:53] Special:Version [20:57:56] On testwiki [20:58:04] Oh yeah, you renamed message keys right? [20:58:06] errgg the key is missing. let me add [20:58:06] That would explain it [20:58:13] Oh ok, easy fix [20:58:23] (03CR) 10jenkins-bot: Move testwiki to MinervaNeue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365093 (owner: 10Chad) [20:58:31] useskin=minerva seems to work on testwiki [20:58:33] (03CR) 10Volans: "-1 (after merge)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/365096 (https://phabricator.wikimedia.org/T170496) (owner: 10Ottomata) [20:58:39] Soooo, yay to that step? [20:59:01] volans: i've talked with moritz abou tthis before [20:59:09] RainbowSprinkles: looking [20:59:14] 10Operations, 10Patch-For-Review: Rename 'restricted' group? - https://phabricator.wikimedia.org/T104671#3437738 (10Dzahn) the following users have been removed from the group: hoo khorn ssastry nuria legoktm addshore all of them were (and are) already in "deployers". [20:59:17] spark just picks any port, and will keep picking ports if there is a conflict [20:59:27] https://test.m.wikipedia.org/wiki/Special:Version?useskin=minervaneue < RainbowSprinkles also working [20:59:28] same for some other hadoop daemons too [20:59:32] since they are short lived [20:59:38] Oh dur [20:59:41] New name [20:59:56] ottomata: ok, but I guess they are connecting to something you know right? [21:00:01] i accidentally applied base firewall to stat1002 and stat1005 in my refactoring [21:00:08] ya [21:00:10] in this case [21:00:15] a user logs into stat1002 [21:00:19] and launches [21:00:22] pyspark --master yarn [21:00:36] the local process will open up a port, and then tell the yarn job what that port is [21:00:37] jdlrobson: Let's fix that l10n issue, then I'll roll out the config change to all of group0 and group1 [21:00:43] then the yarn job will send results back to that port [21:00:47] its a CLI repl [21:00:52] RainbowSprinkles: okay on the i18n issue [21:00:55] :( [21:01:02] Er, actually, all wikis. Cuz group2 still on wmf.7 [21:01:07] Where we don't have code to backport [21:01:21] ottomata: and they cannot be set into a range of ports? [21:01:38] usually those kind of apps allow to set a range to use... [21:02:00] not that i know of, no. i don't think i've examined the spark code for this, but hadoop task throughout the cluster have a simliar problem. there is a range specifiable for those, but there was some bug that kept it from working in whatever version we are using [21:02:05] its been a while since I looked at that [21:02:52] RainbowSprinkles: https://gerrit.wikimedia.org/r/365162 Add missing Minerva skin description message key [21:02:56] anyway, if a decision was made to not have base::firewall there is not setting the class absent that it will be removed, it needs much more manual steps [21:03:01] unless you want reimage ;) [21:03:08] that's master.. but we'll need to swat to wmf.7 and 9 i guess? [21:03:24] Yeah I guess [21:03:25] haha, volans wonder why base::firewall even has an ensure param then? [21:03:35] volans: setting ensure => absent solved my problem [21:03:44] but didn't remove the monitoring of the drop policy [21:03:49] not really, ferm is still running on the host [21:04:13] jdlrobson: Did all the cherry-picks and +2'd all around [21:04:24] base::firewall doesn't mean ensure ferm absent [21:04:32] it just means ensure base firewall absent [21:05:07] base::firewall has include ::ferm [21:05:25] so is the one installing it [21:05:31] without that class you would not have ferm there at all [21:05:35] nor iptables rules [21:05:41] IIRC [21:05:45] volans: aye, but i'm ok with that [21:05:47] ;) [21:05:54] i just want it to not block the ports [21:05:59] i don't mind if ferm is there [21:06:07] RainbowSprinkles: sweet [21:06:17] what I'm saying is that you're having a nit of a frankestain [21:06:27] jdlrobson: This is taking far longer than either of us anticipated last week :D [21:06:31] you have an INPUT policy accept with a ton of rules ACCEPT that doesn't make sense to me [21:07:13] I think that if the final status should be : this machine has no firewall, it needs a proper cleanup [21:07:30] volans: or: base::firewall ensure => absent should do the right thing [21:07:34] or shoudln't exist at all [21:07:35] right? [21:07:49] $ensure param shouldn't exist [21:07:50] if it doesn't work [21:08:14] (03Draft3) 10MacFan4000: Update ExtensionDistributer versions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365137 [21:08:23] in theory, yes, in practice, give me one module that we have that does properly the uninstallation of everything ;) [21:08:25] RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0] [21:08:27] volans: other things actually declare ferm on these nodes [21:09:00] haha, i've made them before, but more recently i've stopped trying. buuuuut if it doesn't work, we shoudln't provide it [21:09:10] so, we don't want to uninstall ferm [21:09:26] ew just want what? to reset the conntrack stuff? [21:09:55] no, because ferm sets iptables rules, stopping/uninstalling it IIRC will just leave the situation as it is [21:10:06] right, i don't want to uninstall it or stop it [21:10:10] and at next reboot you might got a different one ;) [21:10:13] other classes included on this node do declare ferm rules [21:10:21] like? [21:10:53] jdlrobson: So, pull the skin + scap. This will include the changes to hooks? That ok for wmf.9 now? [21:10:56] https://github.com/wikimedia/puppet/blob/06e3b643de7c2f27ecfe8606e7fee6fe6c5091e9/modules/role/manifests/analytics_cluster/rsyncd.pp#L21-L25 [21:10:57] (03PS1) 10Dzahn: add mapped IPv6 address for labtestervices200x [puppet] - 10https://gerrit.wikimedia.org/r/365171 [21:11:01] And when do we pull + push out MobileFrontend? [21:11:07] Do we need to swap config for rest of wikis first? [21:11:22] RainbowSprinkles: what's on test wiki now? [21:11:32] The skin's enabled on testwiki [21:11:37] But I haven't pulled anything else [21:12:03] ottomata: so you want something that was not contemplated in the module, allow for a host to have both rules and a policy of ACCEPT [21:12:05] RainbowSprinkles: so we're enable on all wikis or just wmf9? [21:12:21] and then scap? [21:12:23] I figured enable on all, since you said it falls back on wmf.7 fine [21:12:28] yep [21:12:41] IMHO if this is a case we want to support, the module should be modified to support this use case [21:12:51] deep breath.. [21:12:53] let's do it! [21:12:55] so having the check that checks the ACCEPT policy instead of DROP [21:13:00] and not setting ACCEPT rules if the policy is ACCEPT [21:13:02] ahhhh i see what you are saying. ummmmmm [21:13:20] because they are useless anyway :D [21:13:32] and if you want it to be ACCEPT [21:13:33] better to check it [21:13:46] (03PS1) 10Chad: Enable MinervaNeue everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365172 [21:13:46] otherwise things will break I guess [21:13:48] (03CR) 10Chad: [C: 032] Enable MinervaNeue everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365172 (owner: 10Chad) [21:14:48] ottomata: sorry I really gotta go now, but I think you got my point [21:15:44] (03Merged) 10jenkins-bot: Enable MinervaNeue everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365172 (owner: 10Chad) [21:15:49] (03PS1) 10RobH: labweb100[12] install params [puppet] - 10https://gerrit.wikimedia.org/r/365173 [21:16:09] jdlrobson: Pulling to mwdebug1001 [21:16:33] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labweb100[12].wikimedia.org - https://phabricator.wikimedia.org/T167820#3437793 (10RobH) [21:16:44] RainbowSprinkles: fingers crossed. It's scary how long i've dreamed of this day. I remember asking csteipp about security reviews... so long ago! [21:16:51] (03CR) 10RobH: [C: 032] labweb100[12] install params [puppet] - 10https://gerrit.wikimedia.org/r/365173 (owner: 10RobH) [21:17:19] ottomata: but I wouldn't make all those changes to the ferm module without asking mor.itz if he's ok with that ;) [21:17:45] 10Operations, 10Traffic, 10media-storage: HTTP 429 on image on Commons - https://phabricator.wikimedia.org/T170628#3437497 (10Zoranzoki21) I too. {F8738207} {F8738208} {F8738209} [21:18:04] (03PS2) 10Ottomata: Check that default input policy is ACCEPT if base::firewall ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/365120 [21:18:04] until tomorrow maybe the best thing is just to ack the alerts with a link to the task or CR? [21:18:06] jdlrobson: enwp seems fine on mwdebug1001, but please confirm [21:18:07] aye volans, i'll add him as reviewer to ^ [21:18:21] great, I'll not be around tomorrow [21:18:25] just FYI ;) [21:18:45] me neither :) [21:18:54] lol [21:18:58] RainbowSprinkles: assuming the skin-desc issue is due to not running scap yet? [21:19:02] have a good weekend then [21:19:03] Yeah [21:19:08] That'll get fixed when we cap [21:19:09] *scap [21:19:14] RainbowSprinkles: seeing some css issues on https://en.m.wikipedia.org/w/index.php?title=Special:MobileOptions&returnto=Special%3AVersion [21:19:24] Ew [21:19:27] RainbowSprinkles: is that scap related? [21:19:31] (03CR) 10jerkins-bot: [V: 04-1] Check that default input policy is ACCEPT if base::firewall ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/365120 (owner: 10Ottomata) [21:19:35] Umm, not that I know of? [21:19:45] (03CR) 10Ottomata: "Moritz, I accidentally applied base::firewall to stat1002 and stat1005 while refactoring stuff in puppet. This broke Spark shells there. " [puppet] - 10https://gerrit.wikimedia.org/r/365120 (owner: 10Ottomata) [21:19:54] mwdebug shouldn't be cached [21:19:59] RainbowSprinkles: have no idea why that would happen [21:20:26] something very wrong [21:20:31] ACKNOWLEDGEMENT - Check whether ferm is active by checking the default input chain on stat1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly ottomata See: https://gerrit.wikimedia.org/r/#/c/365120 [21:20:31] ACKNOWLEDGEMENT - Check whether ferm is active by checking the default input chain on stat1005 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly ottomata See: https://gerrit.wikimedia.org/r/#/c/365120 [21:20:34] RECOVERY - High lag on wdqs2003 is OK: OK: Less than 30.00% above the threshold [600.0] [21:20:41] laters, thanks volans :) [21:20:45] jdlrobson: I'm inclined to abort this today. It's already taking far longer than we anticipated. [21:20:53] yeh this doesn't look good :/ [21:21:01] i just wish i knew why [21:21:04] RECOVERY - WDQS SPARQL on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 13048 bytes in 0.073 second response time [21:21:05] RECOVERY - WDQS SPARQL on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 13048 bytes in 0.073 second response time [21:21:09] (03PS1) 10Andrew Bogott: labspuppetbackend: Include some more python dependencies for jessie. [puppet] - 10https://gerrit.wikimedia.org/r/365174 [21:21:15] RECOVERY - WDQS HTTP on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 13048 bytes in 0.073 second response time [21:21:17] jdlrobson: Who on your team does deploys? I'm curious if that'd make things a little easier too :) [21:21:17] robh: could you re-pool wdq2002 and 2003? They are reloaded now. And de-pool 2001 - I'll reload it next [21:21:22] yw, sorry for "blocking" you :) [21:21:24] (I'm poking blind here) [21:21:34] RECOVERY - WDQS HTTP on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 13048 bytes in 0.073 second response time [21:21:45] Soooo, we want to roll back wmf.9 submodules too [21:21:45] RainbowSprinkles: we don't really have one. Sam can do them but seldom does (and he's not on my timezone) [21:22:10] Anyway, let's roll back my config changes [21:22:13] And we'll sort out wmf.9 [21:22:30] (03PS1) 10Chad: Revert "Enable MinervaNeue everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365175 [21:22:35] (03CR) 10Chad: [V: 032 C: 032] Revert "Enable MinervaNeue everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365175 (owner: 10Chad) [21:22:58] RainbowSprinkles: will it freak you out if I break stashbot for a while? [21:23:06] I need to do the elasticsearch cluster upgrade in toolforge [21:23:16] and that will knock the bot out for a while [21:23:19] stashbot? I barely pay attention [21:23:19] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [21:23:20] RainbowSprinkles: any idea how i might be able to replicate that again? [21:23:23] i've never seen that issue [21:23:27] I left it enabled on testwiki [21:23:28] so i'd like to debug [21:23:45] RainbowSprinkles: but test wiki dont have that issue :/ [21:23:47] RainbowSprinkles: it does the logging to SAL, though :) [21:23:49] cool. I'll keep track of !logs it misses [21:23:56] well, in that ^ case :) [21:24:01] Oh, well I'm done sync'ing for now [21:24:09] (03CR) 10jerkins-bot: [V: 04-1] labspuppetbackend: Include some more python dependencies for jessie. [puppet] - 10https://gerrit.wikimedia.org/r/365174 (owner: 10Andrew Bogott) [21:24:18] jdlrobson: I dunno what else to say, but I really don't want to proceed with rolling it out to prod wikis [21:24:22] Right now [21:24:27] RainbowSprinkles: yeh i dont blame you :) [21:24:28] I think we're pretty safe where I left things though [21:24:34] RainbowSprinkles: i just want to work out how i can help this for next time [21:24:49] and since Minerva is now no longer in the master of MobileFrontend [21:24:51] gotta fix that quick [21:25:48] on plus side issue was in wmf7 right? [21:26:18] I imagine that it's because the resources are being registered twice.. once by MobileFrontend and once by Minerva [21:26:30] behaves differently on my local machine though [21:28:08] looks like robh is not around... Anybody can help me re-pool wdq2002 and 2003 on pybal? [21:28:24] im about [21:28:30] i can do =] [21:28:32] robh: oh cool :) [21:28:47] !log robh@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wdqs2003.codfw.wmnet [21:28:51] robh: so please repool 2002 and 2003 and depool 2001, I'll be reloading it next [21:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:56] !log robh@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wdqs2002.codfw.wmnet [21:28:59] thanks! [21:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:08] lemme watch pybal for a bit [21:29:10] before i depool 1001 [21:29:14] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3437807 (10Dzahn) a:05mmodell>03RobH @robh @mmodell It seems this system isn't actually up and running but has an issue. I noticed that i could not ssh to it,... [21:29:15] i wanna see the others take traffic [21:29:17] ok [21:30:12] RainbowSprinkles: oh shit. [21:30:21] Getting an exception on mobile site of https://m.mediawiki.org/wiki/Reading/Web [21:30:24] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3437811 (10mmodell) @Dzahn hmm, I thought it was working before? I think I even logged into it but maybe I am remembering incorrectly. [21:30:25] and https://en.m.wikivoyage.org/wiki/Main_Page [21:31:18] Gah [21:31:19] SMalyshev: ok, its been repooled that if it was failing a check pybal would have output it to the pybal log [21:31:20] RainbowSprinkles: oh i had mwdebug1001 on.. so that's not production. phew. [21:31:26] Error: /Main_Page RuntimeException from line 60 of /srv/mediawiki/php-1.30.0-wmf.9/extensions/MobileFrontend/includes/MobileFrontend.hooks.php: wgMFDefaultSkinClass is not setup correctly. It should point to the class name of a [21:31:34] !log robh@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs2001.codfw.wmnet [21:31:36] That happens when MobileFrontend doesn't have Minerva and Minerva is not enabled [21:31:42] SMalyshev: so you are all set [21:31:54] SMalyshev: as long as you see things happening on wdqs200[23] [21:31:59] Soooo, we're stuck. If we enable it, we break CSS [21:32:02] i just was watching lvs traffic not the actual nodes [21:32:04] If we keep it disabled, we get exception [21:32:15] RainbowSprinkles: i have a theory of how it might be fixed [21:32:21] Can we please just roll the submodules for wmf.9 back to a known-good place? [21:32:25] I don't want to keep debugging this [21:32:54] (03CR) 10jenkins-bot: Enable MinervaNeue everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365172 (owner: 10Chad) [21:32:55] (03CR) 10jenkins-bot: Revert "Enable MinervaNeue everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365175 (owner: 10Chad) [21:33:00] 10Operations, 10Phabricator, 10Release-Engineering-Team (Kanban): setup/install phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T163938#3437812 (10RobH) busy box is a failed hdd spin up during post, I'd try rebooting and I bet it comes back. [21:33:00] we're done debugging, as chad said, go back, there is no forward here [21:34:42] !log demon@tin Locking from deployment [operations/mediawiki-config]: Nobody use this (planned duration: 60m 00s) [21:35:05] thcipriani: Hey cool, I think that's the first time we've seen that logged ^ :) [21:35:35] RECOVERY - High lag on wdqs2002 is OK: OK: Less than 30.00% above the threshold [600.0] [21:35:52] nice :) [21:36:05] !log demon@tin Unlocked for deployment [operations/mediawiki-config]: Nobody use this (duration: 01m 23s) [21:36:14] Anyway, no need to lock it really this minute [21:37:58] well, icinga recoveries of wdqs is good [21:38:07] RainbowSprinkles: if by meaning roll the submodules for wmf.9 back to a known-good place you mean fix MobileFrontend wmf.9 branch all we need to do is revert https://gerrit.wikimedia.org/r/#/c/365122/ [21:38:21] Let's [21:38:23] Please [21:38:31] I wasnt saying debug this now, i was saying i think i know why it's happening (as in we can fix this for next deploy) [21:39:25] RainbowSprinkles: https://gerrit.wikimedia.org/r/365176 [21:39:44] thcipriani: That's the one ^ [21:39:52] jdlrobson: He's gonna help you for awhile, I gotta run a few errands [21:39:53] jdlrobson: tag team, RainbowSprinkles just tagged me in :) [21:40:09] thcipriani: ^ that should stop all the fatals for mobile traffic [21:41:54] jdlrobson: okie doke [21:42:13] but when this is all done we'll have to work out a plan for wmf-next-branch to avoid going through this again :) [21:43:47] I'm catching up -- so this is all deployed on mwdebug1001 and no where else, currently? [21:43:47] robh: thanks! [21:43:56] welcome [21:43:57] =] [21:45:25] thcipriani: mwdebug1001 [21:45:29] yep as far as i know [21:45:38] logstash is only reporting issues there [21:50:11] (03PS2) 10Andrew Bogott: labspuppetbackend: Include some more python dependencies for jessie. [puppet] - 10https://gerrit.wikimedia.org/r/365174 [21:53:03] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: reports.frdev.wm.o -- still in use? - https://phabricator.wikimedia.org/T170640#3437842 (10Dzahn) [21:53:18] jdlrobson: ok, I +2'd that revert, looks like no l10nupdates have been run just yet, so merging and pulling over to mwdebug1001 may be all that's needed to get us back the where this started...does that sounds right to you? [21:53:44] thcipriani: i would expect so [21:54:26] jdlrobson: do we need to revert: https://gerrit.wikimedia.org/r/#/c/365093/ [21:54:43] thcipriani: shouldn't think so [21:55:17] it's wmf9 [21:55:20] issue only impacts wmf7 [21:57:45] okie doke [22:03:31] thcipriani: lemme know when that lands [22:04:18] jdlrobson: just pulled it over on mwdebug1001 [22:04:26] !log Stashbot working after backend ElasticSearch cluster upgrade [22:04:29] thcipriani: fatals are gone w00t [22:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:49] jdlrobson: awesome :) [22:05:02] thcipriani: looks good.. i'll spend rest of afternoon debugging the issue I had with RainbowSprinkles earlier [22:05:17] I'll backfill the missing !logs as soon as I get the sal tool fixed [22:06:08] jdlrobson: sounds good, anything else need to be reverted afayk or are we back to where we started? [22:06:18] thcipriani: going to try to push the train before swat? [22:06:26] greg-g: yes [22:06:30] * greg-g nods [22:06:46] (everybody get out and push) [22:07:32] 10Operations, 10Commons, 10Traffic, 10media-storage: HTTP 429 on image on Commons - https://phabricator.wikimedia.org/T170628#3437878 (10Framawiki) Possibly related to {T170605}, the error is different [22:07:49] 10Operations, 10Commons, 10Traffic, 10media-storage: ERR_RESPONSE_HEADERS_MULTIPLE_CONTENT_DISPOSITION - https://phabricator.wikimedia.org/T170605#3436479 (10Framawiki) [22:08:03] 10Operations, 10Commons, 10Traffic, 10media-storage: HTTP 429 on image on Commons - https://phabricator.wikimedia.org/T170628#3437497 (10BBlack) So, error code 429 is `Too Many Requests`, generally used by ratelimiters. In this case, it seems that thumbor (our internal service that renders thumbnails of i... [22:10:04] jdlrobson: what about this one: https://gerrit.wikimedia.org/r/#/c/364921/1 ? [22:14:57] thcipriani: looking [22:16:30] thcipriani: if i pull wmf9 is that up to date? [22:17:03] jdlrobson: yup should be what is deployed on at least mwdebug1001 [22:17:58] jdlrobson: also curious if https://gerrit.wikimedia.org/r/#/c/365110/ should be reverted although that one looks more innocuous [22:19:29] 10Operations, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Create fishbowl wiki for Maithili Wikimedians User Group - https://phabricator.wikimedia.org/T168782#3437940 (10Urbanecm) @Dereckson Please run createAndPromote.php because the author of this ticket should be the first bureaucrat. [22:19:37] thcipriani: debugging.. [22:20:22] mwdebug1001 looks fine [22:20:38] so i dont think any more reverts are needed but im double checking [22:20:51] thanks :) [22:20:57] thcipriani: MinervaNeue is not enabled [22:21:00] so yeh all is good [22:21:54] is it enabled on testwiki? [22:22:53] (03PS3) 10Andrew Bogott: labspuppetbackend: Include some more python dependencies for jessie. [puppet] - 10https://gerrit.wikimedia.org/r/365174 [22:23:55] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [22:24:27] thcipriani: testwiki is fine [22:24:45] (03CR) 10Andrew Bogott: [C: 032] labspuppetbackend: Include some more python dependencies for jessie. [puppet] - 10https://gerrit.wikimedia.org/r/365174 (owner: 10Andrew Bogott) [22:25:17] okie doke [22:25:26] I'm going to roll forward the train before evening swat then [22:30:25] jdlrobson: actually, I'm still looking back through logs, I'm going to revert all the things you and RainbowSprinkles merged since it doesn't look like they were ever sync'd anywhere. I don't want anyone who has no idea what's currently sync'd and unsync'd to sync something and have it blow up. [22:30:44] +1 [22:30:56] k k k [22:31:25] at least from wmf.7 and wmf.9 -- master is fine with me :) [22:35:00] (03PS1) 10Thcipriani: Revert "Move testwiki to MinervaNeue" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365188 [22:35:05] (03CR) 10Thcipriani: [C: 032] Revert "Move testwiki to MinervaNeue" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365188 (owner: 10Thcipriani) [22:37:01] 10Operations, 10Cloud-Services: rack/setup/install labweb100[12].wikimedia.org - https://phabricator.wikimedia.org/T167820#3437954 (10RobH) [22:38:40] thcipriani: They went to mwdebug1001 [22:38:45] I didn't sync everywhere else [22:38:47] Pending reverts [22:39:36] RainbowSprinkles: yeah, I was looking in the backlog to see if you had or not. I'm reverting everything back to the state that is currently live. [22:39:46] * RainbowSprinkles nods [22:39:48] Best choice [22:39:54] (where we left things yesterday was sane/safe) [22:40:11] heh, well I don't know if I'd go that far :) [22:40:22] but known at least :P [22:43:34] (03Merged) 10jenkins-bot: Revert "Move testwiki to MinervaNeue" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365188 (owner: 10Thcipriani) [22:43:47] (03CR) 10jenkins-bot: Revert "Move testwiki to MinervaNeue" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365188 (owner: 10Thcipriani) [22:44:47] PROBLEM - configured eth on labweb1001 is CRITICAL: Return code of 255 is out of bounds [22:45:37] PROBLEM - dhclient process on labweb1001 is CRITICAL: Return code of 255 is out of bounds [22:46:27] PROBLEM - puppet last run on labweb1001 is CRITICAL: Return code of 255 is out of bounds [22:47:27] PROBLEM - salt-minion processes on labweb1001 is CRITICAL: Return code of 255 is out of bounds [22:47:38] RECOVERY - dhclient process on labweb1001 is OK: PROCS OK: 0 processes with command name dhclient [22:47:47] RECOVERY - configured eth on labweb1001 is OK: OK - interfaces up [22:48:27] RECOVERY - salt-minion processes on labweb1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:48:27] RECOVERY - puppet last run on labweb1001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [22:52:03] RainbowSprinkles: and good news is I worked out wtf happened there [22:52:50] RainbowSprinkles: tdlr: As soon as wmf7 is not important - enable MinervaNeue skin [22:52:53] it works fine with wmf9 [22:53:06] Poor wmf.7 [22:53:09] Gets left out! [22:53:10] then next train it should just work (TM) [22:53:20] wmf7 didnt have https://gerrit.wikimedia.org/r/#/c/357671/ [22:53:26] didnt realise how out of date it was [22:54:18] (03PS1) 10Thcipriani: all wikis to 1.30.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365189 [22:54:20] (03CR) 10Thcipriani: [C: 032] all wikis to 1.30.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365189 (owner: 10Thcipriani) [22:54:39] (hopefully wmf.7 will be not important Soon™) [22:54:48] RainbowSprinkles: but im guessing we'll want to enable on monday [22:54:54] to give us a clear stretch for the week :) [22:56:13] (03Merged) 10jenkins-bot: all wikis to 1.30.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365189 (owner: 10Thcipriani) [22:56:22] (03CR) 10jenkins-bot: all wikis to 1.30.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365189 (owner: 10Thcipriani) [22:57:33] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.30.0-wmf.9 [22:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170713T2300). [23:00:37] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:01:37] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:01:38] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:01:38] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:01:48] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:02:07] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:02:37] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:02:37] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:02:47] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [23:02:57] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [23:03:08] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [23:04:38] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:04:47] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy [23:04:47] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [23:05:11] (03CR) 10Platonides: [C: 031] High density logos for es.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/365066 (https://phabricator.wikimedia.org/T170604) (owner: 10MarcoAurelio) [23:05:47] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [23:06:07] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:06:17] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:07:27] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy [23:07:47] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:07:47] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:08:17] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [23:08:47] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [23:08:47] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:08:47] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [23:08:48] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy [23:08:50] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: revert all wikis to php-1.30.0-wmf.9 [23:08:57] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [23:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:47] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [23:10:47] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy [23:12:52] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [23:13:32] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] [23:14:59] ^ ^ seems to be Timeout reached waiting for an available pooled curl connection! from CirrusSearch/includes/Elastica/PooledHttp.php:67 [23:16:00] for both wmf.9 and wmf.7 -- rolled back because it started exploding, but that doesn't seem to have abated the explosion [23:17:22] wmf.7 and 9? Makes me think it's unrelated [23:17:32] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [23:17:34] yeah, I think it is [23:17:39] and it seems to be subsiding [23:17:54] I'm going to wait for a bit and try rolling forward again [23:19:59] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to php-1.30.0-wmf.9 [23:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:12] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [23:27:03] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:27:03] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:27:03] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:27:03] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:27:12] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:28:12] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [23:29:12] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [23:29:12] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [23:29:12] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [23:30:12] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy [23:30:32] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:31:12] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [23:31:13] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:32:12] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [23:32:12] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:32:22] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [23:32:22] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [23:35:22] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:35:22] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [23:36:22] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [23:36:22] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:36:42] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy [23:37:13] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:38:22] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:38:22] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [23:39:12] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:39:22] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:39:22] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:39:22] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy [23:39:42] that sure is annoying [23:40:22] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy [23:40:25] this seems all realated to the cirrussearch 90th percentile thing [23:40:57] > CirrusSearch eqiad 95th percentile latency on graphite1001 [23:41:22] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy [23:42:12] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [23:44:22] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy [23:44:23] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:45:12] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:46:22] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy [23:46:32] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:46:33] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [23:47:32] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:47:32] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [23:47:52] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: revert all wikis to php-1.30.0-wmf.9, again [23:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:22] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy [23:48:52] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [23:49:52] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy