[00:02:47] (03CR) 10Dzahn: [C: 032] "ok because it's not in prod and already cherrypicked anyways" [puppet] - 10https://gerrit.wikimedia.org/r/377746 (https://phabricator.wikimedia.org/T175764) (owner: 10Hashar) [00:02:59] (03PS2) 10Dzahn: contint: install jsduck via gems [puppet] - 10https://gerrit.wikimedia.org/r/377746 (https://phabricator.wikimedia.org/T175764) (owner: 10Hashar) [00:08:43] (03CR) 10Dzahn: "This would mean "its-phabricator" would not be loaded anymore? It's in the old dir but not the new dir. Is that expected?" [puppet] - 10https://gerrit.wikimedia.org/r/374667 (https://phabricator.wikimedia.org/T157414) (owner: 10Chad) [00:10:45] PROBLEM - salt-minion processes on labtestvirt2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [00:11:15] PROBLEM - DPKG on labtestvirt2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:13:35] (03CR) 10Dzahn: [C: 04-1] "should we consider actually moving librenms behind varnish though? Or do we keep following our old rule of "no monitoring systems behind " [puppet] - 10https://gerrit.wikimedia.org/r/366519 (owner: 10Muehlenhoff) [00:18:35] (03CR) 10Dzahn: [C: 04-1] "i recompiled this and now it reports a syntax error" [puppet] - 10https://gerrit.wikimedia.org/r/356499 (owner: 10Chad) [00:22:41] (03PS4) 10Dzahn: Gerrit: Add non-masters to have public DNS entries [puppet] - 10https://gerrit.wikimedia.org/r/356499 (owner: 10Chad) [00:26:45] PROBLEM - Host db1100 is DOWN: PING CRITICAL - Packet loss = 100% [00:27:55] RECOVERY - Host db1100 is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [00:29:08] (03PS5) 10Dzahn: Gerrit: Add non-masters to have public DNS entries [puppet] - 10https://gerrit.wikimedia.org/r/356499 (owner: 10Chad) [00:30:09] PROBLEM - MariaDB Slave IO: s5 on db1100 is CRITICAL: CRITICAL slave_io_state could not connect [00:30:18] PROBLEM - MariaDB Slave SQL: s5 on db1100 is CRITICAL: CRITICAL slave_sql_state could not connect [00:30:58] PROBLEM - mysqld processes on db1100 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [00:34:45] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1954 bytes in 0.121 second response time [00:35:59] (03PS1) 10Chad: Bump core and all plugins to 2.13.9 [software/gerrit] - 10https://gerrit.wikimedia.org/r/378192 [00:36:28] (03CR) 10Chad: "All artifacts uploaded to archiva.wikimedia.org" [software/gerrit] - 10https://gerrit.wikimedia.org/r/378192 (owner: 10Chad) [00:37:17] (03PS1) 10Jcrespo: mariadb: Depool db1100 (crashed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378193 [00:37:29] PROBLEM - MariaDB Slave Lag: s5 on db1100 is CRITICAL: CRITICAL slave_sql_lag could not connect [00:38:45] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] [00:39:51] (03CR) 10Dzahn: [C: 032] Gerrit: Add non-masters to have public DNS entries [puppet] - 10https://gerrit.wikimedia.org/r/356499 (owner: 10Chad) [00:41:27] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1100 (crashed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378193 (owner: 10Jcrespo) [00:43:04] (03Merged) 10jenkins-bot: mariadb: Depool db1100 (crashed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378193 (owner: 10Jcrespo) [00:43:17] (03CR) 10jenkins-bot: mariadb: Depool db1100 (crashed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378193 (owner: 10Jcrespo) [00:44:05] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[acme-setup-acme-gerrit] [00:45:16] ACKNOWLEDGEMENT - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[acme-setup-acme-gerrit] daniel_zahn https://gerrit.wikimedia.org/r/#/c/356499/ [00:45:16] (03PS1) 10Dzahn: gerrit: skip Letsencrypt cert on gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/378194 [00:45:52] (03CR) 10Dzahn: "causes <+icinga-wm> PROBLEM - puppet last run on gerrit2001 is CRITICAL:" [puppet] - 10https://gerrit.wikimedia.org/r/356499 (owner: 10Chad) [00:46:34] (03PS2) 10Dzahn: gerrit: skip Letsencrypt cert on gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/378194 [00:46:51] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1100 (duration: 00m 46s) [00:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:17] (03CR) 10Dzahn: [C: 032] gerrit: skip Letsencrypt cert on gerrit2001 [puppet] - 10https://gerrit.wikimedia.org/r/378194 (owner: 10Dzahn) [00:49:11] (03CR) 10Dzahn: "are we still waiting with these?" [puppet] - 10https://gerrit.wikimedia.org/r/366818 (owner: 10Muehlenhoff) [00:49:15] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [00:53:55] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] [01:01:11] (03PS9) 10GeoffreyT2000: Rename Wikisaurus namespace on Wiktionary to "Thesaurus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374063 (https://phabricator.wikimedia.org/T174264) [01:04:55] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1925 bytes in 0.122 second response time [01:10:43] (03CR) 10Chad: "This makes https://gerrit2001.wikimedia.org/ inaccessible due to HSTS on the cert. Not a huge deal, but guess it depends on how useful we " [puppet] - 10https://gerrit.wikimedia.org/r/378194 (owner: 10Dzahn) [01:53:51] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), 10User-Ladsgroup: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3609620 (10awight) Thanks for correcting my misunderstanding of ulimit! I don't expect that we'r... [01:54:25] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), 10User-Ladsgroup: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3609622 (10awight) Fruitless adventure into the bowels of os.pipe --- Diving deeper into the cod... [02:22:42] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), 10User-Ladsgroup: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3609624 (10awight) Here's where it gets crazy, though. I was able to clone the deployment direct... [03:00:49] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [03:01:09] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [03:02:19] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [03:08:10] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:08:29] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:08:32] milimetric, I can connect to analytics-store.eqiad.wmnet from stat1006, but not x1-analytics-slave.eqiad.wmnet . I'm just doing mysql -h from stat1006. If I explicitly pass the --defaults-file the results are the same. [03:09:59] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:03:08] matt_flaschen: sorry, not sure what's up, maybe the puppet config didn't have that set up properly on 1003 and the migration messed it up. I'm home with the baby tomorrow, but ask otto or luca, they'll know and also be able to make the change if needed [04:03:31] matt_flaschen: but if it's urgent let me know [04:04:30] milimetric, no, it's not. I can always access it from production if needed. It's for an issue Elena is investigating. [04:24:04] milimetric, what's Luca's Phabricator? [05:04:26] <_joe_> matt_flaschen: elukey [05:04:38] <_joe_> matt_flaschen: but he'll see the backlog here [05:06:59] Thanks [05:33:00] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 230, down: 2, dormant: 0, excluded: 0, unused: 0 [06:10:49] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [06:19:09] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [06:20:09] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [06:32:23] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [06:39:19] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6461/IPv6: Active [06:40:19] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 29, down: 0, shutdown: 0 [06:42:39] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [06:56:23] (03PS1) 10Giuseppe Lavagetto: admin: add a new ed25519 key for myself [puppet] - 10https://gerrit.wikimedia.org/r/378204 [06:58:59] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [07:05:40] hello! matt_flaschen o/ (but not sure if you are still there) [07:06:48] (03CR) 10Muehlenhoff: [C: 031] add tgr to pdfrender-admin sudo group [puppet] - 10https://gerrit.wikimedia.org/r/378060 (https://phabricator.wikimedia.org/T175882) (owner: 10RobH) [07:07:30] (03CR) 10ArielGlenn: [C: 032] admin: add a new ed25519 key for myself [puppet] - 10https://gerrit.wikimedia.org/r/378204 (owner: 10Giuseppe Lavagetto) [07:28:24] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw2256.codfw.wmnet [07:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:47] 10Operations, 10ops-codfw, 10Patch-For-Review, 10User-Elukey: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#3609772 (10MoritzMuehlenhoff) [07:28:49] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3609770 (10MoritzMuehlenhoff) 05Open>03Resolved Agreed. I ran a "scap pull" and repooled the server. Closing the task, we can reopen if it crashes again. [07:29:12] 10Operations, 10ops-eqiad, 10DBA: db1100 crashed - https://phabricator.wikimedia.org/T175973#3609605 (10jcrespo) [07:34:51] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3609776 (10MoritzMuehlenhoff) >> - salt masters (neodymium) With Cumin this also extends to sarin > Also, it's worth noting that the outage that this task was inspired from w... [07:35:23] !log depooling elastic1020 - T175951 [07:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:35] T175951: Search backend error during full_text search for 'QUERY_SRTING' after 39: i_o_exception: Can't read unknown type [50] - https://phabricator.wikimedia.org/T175951 [07:38:59] !log shutting down and masking elasticsearch on elastic1020 - T175951 [07:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:15] (03PS1) 10Elukey: Point x1-analytics-slave to dbstore1002 [dns] - 10https://gerrit.wikimedia.org/r/378211 [07:44:17] (03CR) 10Jcrespo: [C: 031] Point x1-analytics-slave to dbstore1002 [dns] - 10https://gerrit.wikimedia.org/r/378211 (owner: 10Elukey) [07:44:31] (03CR) 10Elukey: [C: 032] Point x1-analytics-slave to dbstore1002 [dns] - 10https://gerrit.wikimedia.org/r/378211 (owner: 10Elukey) [07:49:36] 10Operations, 10DBA: Lost access to x1-analytics-slave - https://phabricator.wikimedia.org/T175970#3609786 (10jcrespo) [07:49:44] !log running maintenance/updateCollation.php --force on Persian (fa) wikis (T173601) [07:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:58] T173601: CollationFa needs some clean up - https://phabricator.wikimedia.org/T173601 [07:50:49] PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[elasticsearch] [07:51:06] ^ oops, that one is me, silencing as well... [07:53:31] 10Operations, 10DBA: Lost access to x1-analytics-slave - https://phabricator.wikimedia.org/T175970#3609796 (10elukey) After a chat with @jcrespo we decided to point the x1-analytics-slave domain to dbstore1002 (analytics-slave) - https://gerrit.wikimedia.org/r/#/c/378211 db1029 is a production host and it wou... [07:53:55] (03CR) 10Paladox: [C: 031] "Did you built this with https://gerrit-review.googlesource.com/#/c/gerrit/+/92830/ please?" [software/gerrit] - 10https://gerrit.wikimedia.org/r/378192 (owner: 10Chad) [08:03:21] 10Operations, 10Analytics-Kanban, 10hardware-requests: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097#3609800 (10elukey) [08:03:24] !log dropping mariadb analytics users on db1031, db1029 [08:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:58] 10Operations, 10ops-eqiad, 10DBA: db1100 crashed - https://phabricator.wikimedia.org/T175973#3609801 (10jcrespo) @Cmjohnson @robh I assume there is not much left to do here at dc/provider level except keeping a record of the crash and complain if it repeats? This is one of the latest models bought. [08:10:53] (03PS1) 10Elukey: Remove any trace of stat1003 for decom [puppet] - 10https://gerrit.wikimedia.org/r/378220 (https://phabricator.wikimedia.org/T173097) [08:11:38] (03CR) 10Elukey: [C: 032] Remove any trace of stat1003 for decom [puppet] - 10https://gerrit.wikimedia.org/r/378220 (https://phabricator.wikimedia.org/T173097) (owner: 10Elukey) [08:13:06] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3609809 (10elukey) [08:13:09] 10Operations, 10ops-eqiad, 10DBA: db1100 crashed - https://phabricator.wikimedia.org/T175973#3609810 (10jcrespo) However, it is a lot of coincidence that it crashes just hours after peing pooling and having some load: https://gerrit.wikimedia.org/r/378003 (it has been idle for weeks before). I would like to... [08:13:56] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:14:16] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [08:14:22] blergh [08:14:26] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:14:53] cp1053 this time [08:14:55] already gone [08:14:56] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:16:09] 10Operations, 10ops-eqiad, 10DBA: db1100 crashed - https://phabricator.wikimedia.org/T175973#3609813 (10jcrespo) p:05Triage>03Low Low after being depooled. [08:17:46] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [08:17:55] 10Operations, 10Traffic: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3609815 (10elukey) {F9550407} Happened again this morning, cp1053 seems the one to blame this time (ints for it in the X-cache header) [08:19:26] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:21:06] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [08:21:17] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:22:06] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:22:06] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:25:56] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [08:27:20] 10Operations, 10ops-eqiad, 10Analytics: Remove stat1002 - https://phabricator.wikimedia.org/T173094#3609827 (10elukey) [08:27:22] 10Operations, 10Analytics-Kanban, 10hardware-requests, 10Patch-For-Review: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097#3609829 (10elukey) [08:27:38] 10Operations, 10ops-eqiad, 10Analytics: Remove stat1002 - https://phabricator.wikimedia.org/T173094#3518439 (10elukey) [08:27:42] 10Operations, 10Analytics-Kanban, 10hardware-requests, 10Patch-For-Review: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097#3518505 (10elukey) 05duplicate>03Open [08:28:37] 10Operations, 10ops-eqiad, 10Analytics: Remove stat1002 - https://phabricator.wikimedia.org/T173094#3518439 (10elukey) 05Open>03Resolved Mentioned in https://phabricator.wikimedia.org/T173097 [08:28:53] 10Operations, 10Analytics-Kanban, 10hardware-requests, 10Patch-For-Review: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097#3518505 (10elukey) [08:29:28] 10Puppet, 10puppet-compiler: Puppet checks for invalid class names - https://phabricator.wikimedia.org/T175979#3609837 (10fgiunchedi) [08:33:12] 10Operations, 10fundraising-tech-ops: Long term storage for frack prometheus data - https://phabricator.wikimedia.org/T175738#3609855 (10fgiunchedi) Sounds awesome! re: indefinite storage, the `global` instance of Prometheus now has 1yr retention, likely to be moved to 2yrs. [08:36:39] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops, 10Spike: Spike: Enumerate remaining unported stats - https://phabricator.wikimedia.org/T175850#3609857 (10fgiunchedi) One way would be to generate grafana dashboards' JSON from python and a list of metrics, namely with sth like `grafanalib` as ou... [08:47:26] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [08:48:01] 10Operations, 10Traffic: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3609869 (10Samtar) It may be worth noting that the two spikes previous to cp1068's were also caused by cp1053 [08:50:21] 10Operations, 10monitoring: Upgrade grafana to 4.5 - https://phabricator.wikimedia.org/T175980#3609871 (10fgiunchedi) [08:50:27] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [08:51:54] 10Operations, 10Analytics-Kanban, 10hardware-requests, 10Patch-For-Review: Decommission stat1002.eqiad.wmnet - https://phabricator.wikimedia.org/T173097#3609884 (10elukey) I removed puppet/salt credentials and wiped puppet, but didn't proceed any further since I didn't want to mess with DC-Ops procedure. I... [08:52:09] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Decommission stat1003.eqiad.wmnet - https://phabricator.wikimedia.org/T175150#3609885 (10elukey) I removed puppet/salt credentials, powered down and wiped puppet, but didn't proceed any further since I didn't want to mess with DC-Ops procedure. I was under the i... [08:53:29] (03PS2) 10Filippo Giunchedi: Add ResourceLoader Grafana performance alerts [puppet] - 10https://gerrit.wikimedia.org/r/378023 (https://phabricator.wikimedia.org/T153171) (owner: 10Gilles) [08:54:26] (03CR) 10Filippo Giunchedi: [C: 032] Add ResourceLoader Grafana performance alerts [puppet] - 10https://gerrit.wikimedia.org/r/378023 (https://phabricator.wikimedia.org/T153171) (owner: 10Gilles) [08:54:49] !log installing libbluetooth updates [08:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:45] moritzm: can i pm, please? [09:06:23] sure [09:10:25] !log installing tcpdump security updates on trusty hosts (Debian systems already fixed) [09:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:50] (03CR) 10Filippo Giunchedi: "I'm ok with moving to a Debian package for this, since jmx_exporter is third party and doesn't see a lot of releases. Moving to Debian pac" (032 comments) [debs/prometheus-jmx-exporter] (debian) - 10https://gerrit.wikimedia.org/r/378037 (owner: 10Ottomata) [09:13:00] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [09:17:55] (03CR) 10Muehlenhoff: "Looks good to me. Can you please add a debian/copyright file pointing to the MIT licensed used by upstream?" (031 comment) [debs/prometheus-jmx-exporter] (debian) - 10https://gerrit.wikimedia.org/r/378037 (owner: 10Ottomata) [09:19:39] (03CR) 10Gehel: Initial debian commit (031 comment) [debs/prometheus-jmx-exporter] (debian) - 10https://gerrit.wikimedia.org/r/378037 (owner: 10Ottomata) [09:22:09] (03CR) 10Gehel: Initial debian commit (031 comment) [debs/prometheus-jmx-exporter] (debian) - 10https://gerrit.wikimedia.org/r/378037 (owner: 10Ottomata) [09:24:25] (03CR) 10Muehlenhoff: "Actually, it's Apache2 licensed, sorry for the confusion." [debs/prometheus-jmx-exporter] (debian) - 10https://gerrit.wikimedia.org/r/378037 (owner: 10Ottomata) [09:26:10] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [09:31:34] (03PS1) 10Gehel: wdqs: reduce heap size to 12GB [puppet] - 10https://gerrit.wikimedia.org/r/378227 (https://phabricator.wikimedia.org/T175919) [09:37:30] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [09:42:40] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [09:47:40] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [09:59:00] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [10:04:52] 10Operations, 10Goal, 10Kubernetes, 10Services (watching), 10User-Joe: Build a slim container for fluentd - https://phabricator.wikimedia.org/T175527#3610031 (10Joe) After some reasoning, I decided to go the following way: - create a fluentbit container for the standardized pod. This will mostly just bl... [10:05:28] 10Operations, 10Beta-Cluster-Infrastructure, 10TCB-Team, 10Release-Engineering-Team (Watching / External), and 2 others: Deploy new Wikidiff2 version on beta-cluster - https://phabricator.wikimedia.org/T175818#3610032 (10MoritzMuehlenhoff) I cherrypicked the two commits into a 1.4.1+wmf2 package, uploaded... [10:09:20] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [10:29:04] 10Operations, 10Beta-Cluster-Infrastructure, 10TCB-Team, 10Release-Engineering-Team (Watching / External), and 2 others: Deploy new Wikidiff2 version on beta-cluster - https://phabricator.wikimedia.org/T175818#3610041 (10Tobi_WMDE_SW) @MoritzMuehlenhoff great, thanks! I think what's still missing is settin... [10:30:11] 10Operations, 10Beta-Cluster-Infrastructure, 10TCB-Team, 10Release-Engineering-Team (Watching / External), and 2 others: Deploy new Wikidiff2 version on beta-cluster - https://phabricator.wikimedia.org/T175818#3610042 (10Addshore) >>! In T175818#3610041, @Tobi_WMDE_SW wrote: > @MoritzMuehlenhoff great, tha... [10:30:32] (03CR) 10Muehlenhoff: Readd rollback handling to debdeploy (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375980 (owner: 10Muehlenhoff) [10:31:10] 10Operations, 10Beta-Cluster-Infrastructure, 10TCB-Team, 10Release-Engineering-Team (Watching / External), and 2 others: Deploy new Wikidiff2 version on beta-cluster - https://phabricator.wikimedia.org/T175818#3610043 (10Tobi_WMDE_SW) @addshore yay, great! [10:40:13] (03PS6) 10Muehlenhoff: Readd rollback handling to debdeploy [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/375980 [10:54:29] (03CR) 10Alexandros Kosiaris: [C: 031] puppetmaster: pass volatile_dir to geoip class [puppet] - 10https://gerrit.wikimedia.org/r/377986 (https://phabricator.wikimedia.org/T175864) (owner: 10Hashar) [10:58:50] (03CR) 10Alexandros Kosiaris: [C: 031] "I 'll merge this on monday, shepherd it across the fleet and ping interested people" [puppet] - 10https://gerrit.wikimedia.org/r/376500 (https://phabricator.wikimedia.org/T175242) (owner: 10Ladsgroup) [11:03:42] 10Operations, 10Salt: Salt minions randomly crashing when the deployment server grain gets changed - https://phabricator.wikimedia.org/T124646#3610065 (10MoritzMuehlenhoff) [11:03:46] 10Operations, 10Salt, 10Trebuchet, 10Patch-For-Review: [Trebuchet] Salt times out on parsoid restarts - https://phabricator.wikimedia.org/T63882#3610066 (10MoritzMuehlenhoff) [11:04:16] 10Operations, 10Salt, 10Patch-For-Review: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#3610070 (10MoritzMuehlenhoff) [11:04:23] 10Operations, 10Salt, 10Patch-For-Review: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#1996363 (10MoritzMuehlenhoff) [11:04:33] 10Operations, 10Salt: Many minions fail to connect to salt master since 10:39 - https://phabricator.wikimedia.org/T129841#3610075 (10MoritzMuehlenhoff) 05Open>03declined Salt is being removed. [11:05:06] 10Operations, 10Thumbor, 10Patch-For-Review, 10Performance-Team (Radar), 10User-fgiunchedi: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817#3610080 (10MoritzMuehlenhoff) [11:05:28] 10Operations, 10Salt, 10Patch-For-Review: slow salt-call invocation on minions - https://phabricator.wikimedia.org/T118380#3610085 (10MoritzMuehlenhoff) 05stalled>03declined Salt is being removed. [11:06:08] 10Operations, 10Salt: on bootup, salt-minion should not start with -d - https://phabricator.wikimedia.org/T104867#3610091 (10MoritzMuehlenhoff) 05Open>03declined Salt is being removed. [11:06:17] 10Operations, 10Salt: salt-minion dies if /var is full - https://phabricator.wikimedia.org/T104866#3610094 (10MoritzMuehlenhoff) 05Open>03declined Salt is being removed. [11:07:16] 10Operations, 10Salt, 10Patch-For-Review: setup/deploy sarin(WMF5851) as a salt master in codfw - https://phabricator.wikimedia.org/T125752#3610100 (10MoritzMuehlenhoff) 05Open>03Resolved This is server is already in use as a Cumin master (which replaces Salt), closing. [11:07:31] 10Operations, 10Salt: Salt minions randomly crashing when the deployment server grain gets changed - https://phabricator.wikimedia.org/T124646#3610104 (10MoritzMuehlenhoff) 05Open>03declined Salt is being removed. [11:07:45] 10Operations, 10Salt: on labcontrol1001, /var/cache/salt has too many files! - https://phabricator.wikimedia.org/T129224#3610107 (10MoritzMuehlenhoff) 05Open>03declined Salt is being removed. [11:14:01] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/377774 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [11:16:07] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), 10User-Ladsgroup: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3610122 (10akosiaris) >>! In T174402#3609620, @awight wrote: > Thanks for correcting my misunders... [11:17:39] thanks moritzm! I am thinking to merge + reimage those hosts on Monday, to avoid any issues over the weekend [11:20:43] sounds good (or merge on Monday, doesn't really matter), it's also a perfect opportunity to test volans' new reimage script further :-) [11:21:16] high number of exceptions on mw2256? [11:22:46] (03PS1) 10Matthias Mullie: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/378233 (https://phabricator.wikimedia.org/T160185) [11:22:46] !log jmm@puppetmaster1001 conftool action : set/pooled=no; selector: mw2256.codfw.wmnet [11:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:06] (03CR) 10Matthias Mullie: [C: 04-1] Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/378233 (https://phabricator.wikimedia.org/T160185) (owner: 10Matthias Mullie) [11:23:48] jynus: that got repooled after hardware maintenance, I ran scap pull before adding it, but the error log is full of "Undefined variable: wmgElectronSecret" [11:23:54] depooled it again for now [11:27:10] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [11:31:40] (03PS2) 10Matthias Mullie: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/378233 (https://phabricator.wikimedia.org/T160185) [11:32:01] (03CR) 10jerkins-bot: [V: 04-1] Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/378233 (https://phabricator.wikimedia.org/T160185) (owner: 10Matthias Mullie) [11:35:55] (03PS1) 10MarcoAurelio: Meta(Talk)Namespace configuration for be.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378234 (https://phabricator.wikimedia.org/T175950) [11:37:20] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [11:37:20] (03PS1) 10Addshore: beta: $wgWikiDiff2MovedParagraphDetectionCutoff = 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378235 (https://phabricator.wikimedia.org/T175818) [11:38:53] (03CR) 10jerkins-bot: [V: 04-1] beta: $wgWikiDiff2MovedParagraphDetectionCutoff = 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378235 (https://phabricator.wikimedia.org/T175818) (owner: 10Addshore) [11:39:38] (03PS2) 10Addshore: beta: $wgWikiDiff2MovedParagraphDetectionCutoff = 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378235 (https://phabricator.wikimedia.org/T175818) [11:50:00] (03PS1) 10Elukey: eventlogging_cleaner.py: add feature to pick start_ts from file [puppet] - 10https://gerrit.wikimedia.org/r/378236 (https://phabricator.wikimedia.org/T156933) [12:48:24] (03PS1) 10Alexandros Kosiaris: Bump celery ores worker # files limit [puppet] - 10https://gerrit.wikimedia.org/r/378239 (https://phabricator.wikimedia.org/T174402) [12:49:40] PROBLEM - HHVM rendering on mw1226 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [12:50:40] RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 80516 bytes in 0.833 second response time [12:51:40] PROBLEM - Nginx local proxy to apache on mw1278 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time [12:51:55] (03PS1) 10Thcipriani: Deploy ocg with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/378241 (https://phabricator.wikimedia.org/T129142) [12:52:21] (03CR) 10jerkins-bot: [V: 04-1] Deploy ocg with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/378241 (https://phabricator.wikimedia.org/T129142) (owner: 10Thcipriani) [12:52:40] RECOVERY - Nginx local proxy to apache on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.044 second response time [13:02:57] (03CR) 10Zoranzoki21: [C: 031] Meta(Talk)Namespace configuration for be.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378234 (https://phabricator.wikimedia.org/T175950) (owner: 10MarcoAurelio) [13:05:36] (03CR) 10Addshore: [C: 032] beta: $wgWikiDiff2MovedParagraphDetectionCutoff = 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378235 (https://phabricator.wikimedia.org/T175818) (owner: 10Addshore) [13:07:27] (03Merged) 10jenkins-bot: beta: $wgWikiDiff2MovedParagraphDetectionCutoff = 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378235 (https://phabricator.wikimedia.org/T175818) (owner: 10Addshore) [13:07:31] (03CR) 10jenkins-bot: beta: $wgWikiDiff2MovedParagraphDetectionCutoff = 25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378235 (https://phabricator.wikimedia.org/T175818) (owner: 10Addshore) [13:08:46] !log addshore@tin Synchronized wmf-config/InitialiseSettings-labs.php: [[gerrit:378235|beta: $wgWikiDiff2MovedParagraphDetectionCutoff = 25]] BETA ONLY 1/2 (duration: 00m 47s) [13:08:50] <_joe_> thcipriani|afk: that's... [13:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:05] <_joe_> are we at the point where we need to do that? [13:09:19] (03PS2) 10Thcipriani: Deploy ocg with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/378241 (https://phabricator.wikimedia.org/T129142) [13:10:03] !log addshore@tin Synchronized wmf-config/CommonSettings-labs.php: [[gerrit:378235|beta: $wgWikiDiff2MovedParagraphDetectionCutoff = 25]] BETA ONLY 2/2 (duration: 00m 45s) [13:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:25] _joe_: do which, migrate or decline migration? [13:10:26] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw2256.codfw.wmnet [13:10:35] <_joe_> thcipriani|afk: migrate :(( [13:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:57] <_joe_> I hoped we could put ocg out of its misery before we needed to migrate it [13:11:11] 10Operations, 10Beta-Cluster-Infrastructure, 10TCB-Team, 10Patch-For-Review, and 3 others: Deploy new Wikidiff2 version on beta-cluster - https://phabricator.wikimedia.org/T175818#3610346 (10Addshore) Now set on beta! [13:11:28] _joe_: I declined the task to migrate. I wrote the patches in case the new service shows problems. [13:11:38] (patches were quick to fire off) [13:11:50] <_joe_> thcipriani|afk: ok, fair enough :) [13:11:50] so *hopefully* no, we're not at that point :) [13:11:56] <_joe_> a last resort thing :P [13:12:04] yep [13:12:15] <_joe_> well we can even decide we'll never deploy ocg again even if it's not dead [13:12:31] PROBLEM - HHVM rendering on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:13:09] true enough, as was mentioned on the task we could also rsync if needed, I just figured I'd cover my bases [13:13:30] RECOVERY - HHVM rendering on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 80303 bytes in 0.128 second response time [13:13:31] !log pruned hhvm bytecode cache on mw2256, led to log spam for an undefined variable after being out of service for hardware maintenance [13:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:05] that mw2256 does not want to let go [13:21:49] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3610355 (10Samwalton9) Wiley are checking in to ask whether the problem is now resolved. Can anyone confirm? [13:26:06] elukey: it's fixed now :-) [13:27:53] 10Operations, 10Beta-Cluster-Infrastructure, 10TCB-Team, 10Patch-For-Review, and 3 others: Deploy new Wikidiff2 version on beta-cluster - https://phabricator.wikimedia.org/T175818#3610358 (10Tobi_WMDE_SW) Confirming that it seems to work: https://en.wikipedia.beta.wmflabs.org/w/index.php?title=User%3ATobia... [13:28:19] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3610359 (10Kettner2) Hi all, I can confirm that the problem is solved, just checked for a DOI of a Wiley j... [13:35:36] 10Operations, 10Beta-Cluster-Infrastructure, 10TCB-Team, 10Patch-For-Review, and 3 others: Deploy new Wikidiff2 version on beta-cluster - https://phabricator.wikimedia.org/T175818#3610361 (10Tobi_WMDE_SW) 05Open>03Resolved [13:41:00] 10Operations, 10Citoid, 10VisualEditor, 10Services (watching), 10User-mobrovac: Wiley requests for DOI and some other publishers don't work in production - https://phabricator.wikimedia.org/T165105#3610375 (10akosiaris) I don't think it has, but I 'd like a confirmation from @Mvolz. The stuff I posted in... [13:41:27] (03CR) 10Alexandros Kosiaris: [C: 032] Bump celery ores worker # files limit [puppet] - 10https://gerrit.wikimedia.org/r/378239 (https://phabricator.wikimedia.org/T174402) (owner: 10Alexandros Kosiaris) [13:42:33] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), 10User-Ladsgroup: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3610378 (10akosiaris) In the interest of moving this forward I 've increased the limit to 8k fds.... [14:16:20] (03PS1) 10Gehel: [WIP] maps: move to vector tiles and cleartables [puppet] - 10https://gerrit.wikimedia.org/r/378245 (https://phabricator.wikimedia.org/T157613) [14:16:44] (03CR) 10jerkins-bot: [V: 04-1] [WIP] maps: move to vector tiles and cleartables [puppet] - 10https://gerrit.wikimedia.org/r/378245 (https://phabricator.wikimedia.org/T157613) (owner: 10Gehel) [14:30:15] <|404> where can I find the wikimedia variant of mail-service-notification.sh for icinga? [14:30:29] <|404> maybe it's sh.erb, but i guess it's in a repo? [14:30:47] operations/puppet ? [14:31:00] no idea though, just a guess [14:31:16] <|404> tabbycat: was my guess too, but phab fails in locating it, github as well [14:33:10] <|404> and a manual search through operations-puppet/modules/icinga does the same [14:34:06] https://phabricator.wikimedia.org/diffusion/LICT/ ? [14:35:01] cannot find it either [14:35:28] <|404> hm, that's icinga2, not sure if it has the same effect, since wmf uses icinga 1 currently. my problem is, I need a mail-service-notification.sh file which puts all reports containing a special word into a seperate file too [14:35:44] <|404> and since wmf divides reports in files, I guess I'd find that in that file [14:35:52] <|404> since the regexp help online did not help [14:36:04] google search? [14:36:05] <|404> if [[ $SERVICEDESC =~ .*ZNC.* ]]; then works not currently [14:36:18] <|404> yeah, I googled it [14:39:02] <_joe_> win 61 [14:39:14] (03PS2) 10Ottomata: Initial debian commit [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/378040 (https://phabricator.wikimedia.org/T175922) [14:39:39] (03CR) 10Ottomata: "Moritz, did your last comment mean that debian/copyright was not necessary, or that I should put an Apache 2 license there?" (032 comments) [debs/prometheus-jmx-exporter] (debian) - 10https://gerrit.wikimedia.org/r/378037 (owner: 10Ottomata) [14:40:20] (03Abandoned) 10Ottomata: Initial debian commit [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/378040 (https://phabricator.wikimedia.org/T175922) (owner: 10Ottomata) [14:40:22] (03PS2) 10Gehel: [WIP] maps: move to vector tiles and cleartables [puppet] - 10https://gerrit.wikimedia.org/r/378245 (https://phabricator.wikimedia.org/T157613) [14:41:28] (03PS3) 10Ottomata: Initial debian commit [debs/prometheus-jmx-exporter] (debian) - 10https://gerrit.wikimedia.org/r/378037 (https://phabricator.wikimedia.org/T175922) [14:44:22] (03PS4) 10Ottomata: Initial debian commit [debs/prometheus-jmx-exporter] (debian) - 10https://gerrit.wikimedia.org/r/378037 (https://phabricator.wikimedia.org/T175922) [14:44:57] (03CR) 10Ottomata: "I think you meant it should be Apache! I symlinked it to the upstream repo LICENSE file. S'ok?" [debs/prometheus-jmx-exporter] (debian) - 10https://gerrit.wikimedia.org/r/378037 (https://phabricator.wikimedia.org/T175922) (owner: 10Ottomata) [14:47:45] (03PS1) 10BBlack: varnishxcps: generate new hierarchical TLS stats [puppet] - 10https://gerrit.wikimedia.org/r/378246 [15:06:10] (03CR) 10Muehlenhoff: "The way it is, is fine, simply have a debian/copyright file which points to the license in use." [debs/prometheus-jmx-exporter] (debian) - 10https://gerrit.wikimedia.org/r/378037 (https://phabricator.wikimedia.org/T175922) (owner: 10Ottomata) [15:07:08] (03CR) 10Ottomata: "Gr8, somebody give me that sweet sweet +1" [debs/prometheus-jmx-exporter] (debian) - 10https://gerrit.wikimedia.org/r/378037 (https://phabricator.wikimedia.org/T175922) (owner: 10Ottomata) [15:07:42] (03CR) 10Muehlenhoff: [C: 031] Initial debian commit [debs/prometheus-jmx-exporter] (debian) - 10https://gerrit.wikimedia.org/r/378037 (https://phabricator.wikimedia.org/T175922) (owner: 10Ottomata) [15:08:08] (03CR) 10Ottomata: [V: 032 C: 032] "danke!" [debs/prometheus-jmx-exporter] (debian) - 10https://gerrit.wikimedia.org/r/378037 (https://phabricator.wikimedia.org/T175922) (owner: 10Ottomata) [15:09:52] \o/ [15:18:40] (03CR) 10BBlack: [C: 032] varnishxcps: generate new hierarchical TLS stats [puppet] - 10https://gerrit.wikimedia.org/r/378246 (owner: 10BBlack) [15:20:20] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 2058316 [15:21:29] <_joe_> !log uploaded td-agent-bit (fluentd-bit) to wikimedia-stretch T175527 [15:21:41] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2004475 [15:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:47] T175527: Build a slim container for fluentd - https://phabricator.wikimedia.org/T175527 [15:35:50] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2049351 [15:36:25] this is not good [15:37:15] all upload hosts [15:37:19] I'm here [15:37:22] looking [15:37:24] hi bblack ! [15:37:26] thanks :) [15:38:24] !log cp1072 - backend restart, mailbox lag [15:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:00] !log cp1099 - backend restart, mailbox lag [15:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:20] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 0 [15:41:50] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 0 [15:45:50] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 0 [15:46:05] !log cp1049 - backend restart, mailbox lag [15:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:34] I was a bit suspicious when I saw all three of them popping up at the same time, but it seem okish [15:48:18] yeah [15:48:32] request load is a factor in exacerbating the mailbox lag patterns [15:49:05] so I think this is just a particularly egregious case of 3/11 nodes all spiking up around the same time, mostly in response to their daily load patterns [15:49:46] * elukey nods [15:50:12] it's not great to effectively wipe 27% of our upload@eqiad cache storage over a period of ~8 minutes, either. [15:50:41] but there's not a lot of great alternatives, we can't wait for them all to start inducing 503s and then find the situation more rushed [15:50:48] (or sit here staring and waiting for that to eventually happen) [15:54:17] yep yep makes sense [15:56:02] (03PS1) 10Ottomata: Versionless .jar symlink fix [debs/prometheus-jmx-exporter] (debian) - 10https://gerrit.wikimedia.org/r/378256 [15:56:32] (03CR) 10Ottomata: [V: 032 C: 032] Versionless .jar symlink fix [debs/prometheus-jmx-exporter] (debian) - 10https://gerrit.wikimedia.org/r/378256 (owner: 10Ottomata) [16:03:53] (03PS1) 10Ottomata: Versionless .jar symlink fix [debs/prometheus-jmx-exporter] (debian) - 10https://gerrit.wikimedia.org/r/378258 [16:06:42] (03CR) 10Ottomata: [V: 032 C: 032] Versionless .jar symlink fix [debs/prometheus-jmx-exporter] (debian) - 10https://gerrit.wikimedia.org/r/378258 (owner: 10Ottomata) [16:16:19] (03PS1) 10Giuseppe Lavagetto: Improvements to the build script [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/378259 [16:16:22] (03PS1) 10Giuseppe Lavagetto: Add fluent-bit image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/378260 (https://phabricator.wikimedia.org/T175527) [16:21:56] (03CR) 10Ottomata: role::kafka::jumbo::broker: enable Prometheus JMX monitoring (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/377753 (https://phabricator.wikimedia.org/T175922) (owner: 10Elukey) [16:41:17] 10Operations, 10ops-eqiad, 10DBA, 10Phabricator: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3610984 (10mmodell) @jcrespo: ok whenever works for you I'll try to be available. [16:49:31] (03CR) 1020after4: [C: 031] Scap3: Go ahead and `scap deploy --init` a freshly provisioned repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377304 (owner: 10Chad) [16:50:14] (03PS2) 1020after4: Scap3: Go ahead and `scap deploy --init` a freshly provisioned repo [puppet] - 10https://gerrit.wikimedia.org/r/377304 (owner: 10Chad) [16:50:31] (03CR) 1020after4: "So yeah, creates prevents it running twice, so refreshonly isn't needed I think." [puppet] - 10https://gerrit.wikimedia.org/r/377304 (owner: 10Chad) [16:50:53] (03CR) 10jerkins-bot: [V: 04-1] Scap3: Go ahead and `scap deploy --init` a freshly provisioned repo [puppet] - 10https://gerrit.wikimedia.org/r/377304 (owner: 10Chad) [16:55:40] elukey: morbid curiosity here, all three upload hosts' `Varnish expiry mailbox lag` went critical at once? o.O [17:06:33] (03CR) 10Dzahn: [C: 032] puppetmaster: pass volatile_dir to geoip class [puppet] - 10https://gerrit.wikimedia.org/r/377986 (https://phabricator.wikimedia.org/T175864) (owner: 10Hashar) [17:07:47] .. dependency though .. hrmm [17:11:02] 10Operations, 10DNS, 10Patch-For-Review: arbcom-fi.wikipedia.org does not have a mobile version at arbcom-fi.m.wikipedia.org - https://phabricator.wikimedia.org/T175447#3593590 (10Dzahn) Operations and DNS part of this is already done - but there are more actions needed or no? [17:11:42] 10Operations, 10DNS, 10Patch-For-Review: arbcom-fi.wikipedia.org does not have a mobile version at arbcom-fi.m.wikipedia.org - https://phabricator.wikimedia.org/T175447#3611043 (10Dzahn) 05Open>03Resolved a:03Dzahn Please feel free to reopen if there is something more to it. [17:15:03] 10Operations, 10Ops-Access-Requests: Requesting access to Stat1005 for zhousquared - https://phabricator.wikimedia.org/T175959#3609129 (10Dzahn) @ZhouZ If something happens like losing your laptop, please tell us about it ASAP, so we can revoke the old key first. [17:15:58] (03PS1) 10Dzahn: admins: remove zhousquared's SSH key [puppet] - 10https://gerrit.wikimedia.org/r/378262 [17:16:41] (03CR) 10Dzahn: [C: 032] admins: remove zhousquared's SSH key [puppet] - 10https://gerrit.wikimedia.org/r/378262 (owner: 10Dzahn) [17:17:22] 10Operations, 10Ops-Access-Requests: Requesting access to Stat1005 for zhousquared - https://phabricator.wikimedia.org/T175959#3611063 (10Dzahn) removed the old key : https://gerrit.wikimedia.org/r/#/c/378262/ [17:17:31] 10Operations, 10Ops-Access-Requests: Requesting access to Stat1005 for zhousquared - https://phabricator.wikimedia.org/T175959#3611064 (10ZhouZ) Yes, please revoke the old key - sorry I forgot to tell you earlier. [17:22:30] PROBLEM - Nginx local proxy to apache on mw1288 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time [17:22:50] PROBLEM - Apache HTTP on mw1288 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [17:23:10] PROBLEM - HHVM rendering on mw1288 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [17:23:51] RECOVERY - Apache HTTP on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.042 second response time [17:24:20] RECOVERY - HHVM rendering on mw1288 is OK: HTTP OK: HTTP/1.1 200 OK - 80320 bytes in 0.287 second response time [17:24:40] RECOVERY - Nginx local proxy to apache on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.097 second response time [17:27:20] (03PS1) 10Catrope: Enable $wgStructuredChangeFiltersOnWatchlist on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378263 (https://phabricator.wikimedia.org/T164234) [17:27:22] (03PS1) 10Catrope: Enable structured change filters by default on cawiki, frwiki and hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378264 (https://phabricator.wikimedia.org/T157642) [17:27:24] (03PS1) 10Catrope: Enable structured change filters by default on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378265 (https://phabricator.wikimedia.org/T157642) [17:27:53] (03CR) 10Catrope: [C: 04-2] "Not before September 19th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378263 (https://phabricator.wikimedia.org/T164234) (owner: 10Catrope) [17:28:02] (03CR) 10Catrope: [C: 04-2] "Not before September 19th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378264 (https://phabricator.wikimedia.org/T157642) (owner: 10Catrope) [17:28:32] (03CR) 10Catrope: [C: 04-2] "Not before September 26th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378265 (https://phabricator.wikimedia.org/T157642) (owner: 10Catrope) [17:43:04] (03CR) 10Hashar: "Need the parent change to be merged as well https://gerrit.wikimedia.org/r/#/c/377980/ :]" [puppet] - 10https://gerrit.wikimedia.org/r/377986 (https://phabricator.wikimedia.org/T175864) (owner: 10Hashar) [18:14:12] PROBLEM - HHVM rendering on mw1280 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [18:14:13] PROBLEM - Nginx local proxy to apache on mw1280 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time [18:14:22] PROBLEM - Apache HTTP on mw1280 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [18:15:12] RECOVERY - HHVM rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 80347 bytes in 0.443 second response time [18:15:13] RECOVERY - Nginx local proxy to apache on mw1280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.092 second response time [18:15:22] RECOVERY - Apache HTTP on mw1280 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.085 second response time [18:24:40] 10Operations, 10DBA: Lost access to x1-analytics-slave - https://phabricator.wikimedia.org/T175970#3611209 (10Etonkovidova) @elukey Thanks! I can access I need an access to `echo_` tables and I do not see them replicated there (or the acccess is not allowed?). [18:32:40] 10Operations, 10DBA: Lost access to x1-analytics-slave - https://phabricator.wikimedia.org/T175970#3611225 (10jcrespo) @Etonkovidova Can you provide more information about how you plan to use that data? It was not initially included on analytics data because it was both more difficult technologically and not n... [19:21:22] 10Operations, 10DBA: Lost access to x1-analytics-slave - https://phabricator.wikimedia.org/T175970#3611260 (10Etonkovidova) @jcrespo Echo notifications is a project that Global Collaboration team is responsible for (I am a QA for that team). From time to time, following users' complaints or requests, or our ow... [19:35:40] (03CR) 10Mforns: [C: 031] "LGTM! :]" [puppet] - 10https://gerrit.wikimedia.org/r/378236 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [20:14:37] 10Operations, 10Wikimedia-Logstash, 10vm-requests, 10Discovery-Search (Current work), 10Patch-For-Review: Provision VMs on Ganeti for logstash100[123] - https://phabricator.wikimedia.org/T173565#3611289 (10debt) [20:14:39] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work), 10Patch-For-Review: setup/install logstash100[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T175045#3611287 (10debt) 05Open>03Resolved Thanks, @RobH and @Gehel ! [20:18:49] (03CR) 10Dzahn: "so this needs to happen before https://gerrit.wikimedia.org/r/#/c/374667/ right?" [software/gerrit] - 10https://gerrit.wikimedia.org/r/378192 (owner: 10Chad) [20:18:49] 10Operations, 10Operations-Software-Development, 10Goal, 10Patch-For-Review, 10Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3611291 (10debt) [20:19:59] (03PS22) 10Dzahn: Zuul: Add systemd script for zuul [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) (owner: 10Paladox) [20:20:32] (03CR) 10Paladox: [C: 031] "> so this needs to happen before https://gerrit.wikimedia.org/r/#/c/374667/" [software/gerrit] - 10https://gerrit.wikimedia.org/r/378192 (owner: 10Chad) [20:28:28] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10Performance-Team (Radar): Decommission osmium.eqiad.wmnet - https://phabricator.wikimedia.org/T175093#3611302 (10Dzahn) p:05Triage>03Normal [20:31:27] 10Operations, 10Phabricator, 10Traffic, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3558443 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/368775/ is merged but @Dispenser you are saying there should be a another patch adding more IPs? Do we still have ongoi... [20:44:32] PROBLEM - puppet last run on mwdebug1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:59:33] PROBLEM - Check Varnish expiry mailbox lag on cp1073 is CRITICAL: CRITICAL: expiry mailbox lag is 2032518 [21:00:26] (03CR) 10Volans: [C: 04-1] "Looks good in general, I think that with the refactoring we're loosing a safety check, see inline together with some other comment, mostly" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/378236 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [21:07:50] (03CR) 10Dzahn: Zuul: Add systemd script for zuul (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) (owner: 10Paladox) [21:09:25] (03PS23) 10Paladox: Zuul: Add systemd script for zuul [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) [21:09:39] (03CR) 10Paladox: Zuul: Add systemd script for zuul (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) (owner: 10Paladox) [21:12:52] RECOVERY - puppet last run on mwdebug1002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [21:17:20] (03PS24) 10Paladox: Zuul: Add systemd script for zuul [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) [21:25:25] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/7900/" [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) (owner: 10Paladox) [21:26:44] 10Operations, 10Ops-Access-Requests: Requesting access to Stat1005 for zhousquared - https://phabricator.wikimedia.org/T175959#3611440 (10Dzahn) a:03Dzahn [21:36:09] zuul-merger.service is running on both contint servers, while zuul.service is just running on 1001 and dead on 2001.. is it supposed to be that way? [21:36:22] hashar: ^ [21:36:33] 2001 is just a warm standby or is it [21:41:50] (03CR) 10Chad: "I76edf0e9a6b91171829050eaa07730e4565ffb1f needs to land first" [puppet] - 10https://gerrit.wikimedia.org/r/374667 (https://phabricator.wikimedia.org/T157414) (owner: 10Chad) [21:42:00] (03CR) 10Chad: [V: 032 C: 032] Bump core and all plugins to 2.13.9 [software/gerrit] - 10https://gerrit.wikimedia.org/r/378192 (owner: 10Chad) [21:43:00] !log demon@tin Started deploy [gerrit/gerrit@48ad332]: 2.13.9 -- not being used yet though [21:43:08] !log demon@tin Finished deploy [gerrit/gerrit@48ad332]: 2.13.9 -- not being used yet though (duration: 00m 08s) [21:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:37] :) [21:44:21] no_justifications, did you build with that change described on ^^. Sorry for repeating. [21:44:28] no_justification ^^ [21:44:44] Huh? [21:44:47] What change? [21:45:04] https://gerrit-review.googlesource.com/#/c/gerrit/+/92830/ [21:46:15] (03CR) 10Dzahn: [C: 032] "already had a +1 from Hashar earlier, compiler output looks good now, going to test on contint2001 first, where zuul-service is currently " [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) (owner: 10Paladox) [21:46:42] (03CR) 10Paladox: "thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/359016 (https://phabricator.wikimedia.org/T167833) (owner: 10Paladox) [21:47:37] !log contint1001 - tmp disable puppet - contint2001 - test zuul unit file (gerrit 359016) [21:47:39] mutante: yeah paladox kind of aced that change :] [21:47:44] :) [21:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:56] paladox: Yes, I built from that [21:48:03] hashar: heh there you are, so i wanted to let you know zuul.service was already dead on contint2001 [21:48:04] 2.13.9 tag + cherry picked [21:48:08] but not zuul-merger [21:48:43] mutante: so yeah contint2001 has the zuul-merger active so we got two of them [21:48:56] but zuul-server is masked/inactive or whatever . That is the spare [21:49:01] in case contint1001 explodes [21:49:11] no_justification thanks. Sorry for pinging and asking if you did it on there. :) :) [21:49:16] It's k [21:49:25] hashar: ok, so then it is as it should be :) but i am also not breaking anything if zuul starts on 2001, right [21:49:45] I am not sure what will happen :] [21:49:51] maybe we will end up with two zuul [21:50:02] there's always a chance it will start but then fail as a test is started. [21:50:12] though if it stays running, that's a good sign it will work. [21:50:41] !log contint2001 - systemctl start zuul [21:50:51] just to confirm the unit file works, right [21:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:08] it's active (running) on both, but i can stop it again [21:52:17] but yea, it does work [21:52:33] :) [21:53:23] when you migrate the one to cont1001 , make sure it still emits to statsd [21:53:34] enables puppet again on contint1001 and lets the change apply there [21:53:43] after a few minutes there should still be graph at the bottom https://integration.wikimedia.org/zuul/ ( after refreshing the page) [21:53:58] hashar: ok! [21:54:15] hashar: would you prefer i stop it on 2001 again for now? [21:54:25] yup [21:54:35] though puppet might stop it for you [21:54:50] done. stopped [21:55:46] runs puppet to confirm it stays stopped. yes it does [22:01:25] (03CR) 10Smalyshev: [C: 031] wdqs: reduce heap size to 12GB [puppet] - 10https://gerrit.wikimedia.org/r/378227 (https://phabricator.wikimedia.org/T175919) (owner: 10Gehel) [22:01:57] mutante: anyway it is past midnight and thus I am officially sleeping :] [22:02:28] the graph still looks fine [22:02:46] the status is fine... all good afaict [22:04:37] good night hashar :) [22:05:25] paladox: mutante: congratulations :] [22:05:37] :) thanks, and paladox for writing it [22:05:45] :) :) [22:05:46] your welcome :) [22:07:24] i'm sure this isn't change related, but while seeing zuul logs: zuul.GerritEventConnector: Received unrecognized event type 'ref-replicated' from Gerrit. [22:07:42] yeah that is harmless [22:07:53] ref-replicated and ref-replicationd-one are not recognized [22:07:55] ok :) [22:08:08] I have made Zuul to raise a WARNING log whenever it receives an event from Gerrit that it does not understand [22:08:17] *nods* [22:08:22] that is a way for people to find out that Zuul need to be taught some new event [22:08:30] I guess that can be monkey patched [22:08:43] it is probably better to have a clean journalctl [22:08:56] feel free to fill it in phabricator [22:09:07] alrighty [22:09:10] at some point I will rebuild the zuul.deb entirely [22:09:42] RECOVERY - Check Varnish expiry mailbox lag on cp1073 is OK: OK: expiry mailbox lag is 224 [22:09:44] ok yep. dont worry about it now :) weekend [22:10:28] thx! [22:16:22] PROBLEM - puppet last run on poolcounter1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:16:55] 10Operations, 10Phabricator, 10Traffic, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3611466 (10Dispenser) @Dzahn The issue demonstrated is that we cannot reliably filter for Wikipedia Zero connectivity. This can be remedied by: * By Ops continuously communica... [22:19:39] (03PS3) 10Chad: Scap3: Go ahead and `scap deploy --init` a freshly provisioned repo [puppet] - 10https://gerrit.wikimedia.org/r/377304 [22:38:22] PROBLEM - HHVM rendering on mw1208 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time [22:39:22] RECOVERY - HHVM rendering on mw1208 is OK: HTTP OK: HTTP/1.1 200 OK - 80374 bytes in 1.498 second response time [22:41:41] (03PS2) 10Thcipriani: CI: install docker-ce from download.docker.com [puppet] - 10https://gerrit.wikimedia.org/r/377492 (https://phabricator.wikimedia.org/T175293) [22:44:43] RECOVERY - puppet last run on poolcounter1002 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [22:51:49] (03CR) 10Thcipriani: "puppet compiler run: https://puppet-compiler.wmflabs.org/compiler02/7901/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/377492 (https://phabricator.wikimedia.org/T175293) (owner: 10Thcipriani) [22:57:02] 10Operations, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Provision Docker >= 17.05 on contint1001 - https://phabricator.wikimedia.org/T175293#3611511 (10thcipriani) >>! In T175293#3599497, @hashar wrote: > Potentially the required upstream package could be added to a new c... [23:14:30] (03PS1) 10Smalyshev: Generate daily diffs for categories RDF [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T173774) [23:15:05] (03CR) 10jerkins-bot: [V: 04-1] Generate daily diffs for categories RDF [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T173774) (owner: 10Smalyshev) [23:16:26] (03PS2) 10Smalyshev: Generate daily diffs for categories RDF [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T173774) [23:16:54] (03CR) 10jerkins-bot: [V: 04-1] Generate daily diffs for categories RDF [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T173774) (owner: 10Smalyshev) [23:20:07] (03PS3) 10Smalyshev: Generate daily diffs for categories RDF [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T173774) [23:22:43] PROBLEM - MariaDB Slave Lag: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [23:24:12] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [23:24:12] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [23:24:13] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [23:24:13] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [23:24:13] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [23:24:22] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [23:24:22] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [23:24:22] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [23:24:23] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [23:24:23] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [23:24:23] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [23:24:32] PROBLEM - MariaDB Slave SQL: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [23:24:32] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [23:24:42] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [23:24:43] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [23:24:52] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [23:24:52] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state could not connect [23:24:52] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [23:24:53] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [23:24:53] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: CRITICAL slave_io_state could not connect [23:25:17] (03PS1) 10MaxSem: Start migration to Unicode sections everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378357 (https://phabricator.wikimedia.org/T152540) [23:27:48] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Zuul: Migrate zuul-server behind systemd service - https://phabricator.wikimedia.org/T167845#3611554 (10Dzahn) 05Open>03Resolved 17:17 <+wikibugs> (PS24) Paladox: Zuul: Add s... [23:31:03] PROBLEM - MariaDB Slave Lag: m3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [23:31:22] PROBLEM - MariaDB Slave Lag: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [23:31:42] PROBLEM - MariaDB Slave Lag: x1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [23:31:42] PROBLEM - MariaDB Slave Lag: m2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [23:31:43] PROBLEM - MariaDB Slave Lag: s4 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [23:31:43] PROBLEM - MariaDB Slave Lag: s6 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [23:31:53] PROBLEM - MariaDB Slave Lag: s1 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [23:31:53] PROBLEM - MariaDB Slave Lag: s7 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag could not connect [23:37:15] dbstore1001 crashed