[00:12:30] 10Operations, 10ops-eqiad, 10Cloud-VPS: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4173352 (10Andrew) @Cmjohnson, sorry, now we're talking about doing this all in one day. Could you be available for a specific appointment (probably around 1PM) to re-ra... [00:17:20] !log start reindex for commonswiki, eqiad elasticsearch, commonswiki_general appears to have failed previous reindex [00:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:53] !log l10nupdate@tin scap sync-l10n completed (1.32.0-wmf.1) (duration: 07m 11s) [02:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:46] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 825.92 seconds [04:12:35] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0 [04:12:35] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [04:15:55] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 159.84 seconds [04:31:23] 10Operations, 10Puppet, 10puppet-compiler, 10Continuous-Integration-Config: Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532#4173521 (10Joe) The compliler has little to do with @EddieGP's request, which seems sensible, and has to do with the jenki... [05:08:15] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [05:08:25] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 [05:11:45] PROBLEM - HTTP availability for Nginx -SSL terminators- on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:12:56] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [05:13:05] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [05:13:45] RECOVERY - HTTP availability for Nginx -SSL terminators- on einsteinium is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [05:20:56] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [05:21:05] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:11:25] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 26 probes of 299 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [06:26:25] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 13 probes of 299 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [06:51:35] (03PS1) 10Elukey: role::druid::public::worker: prep work before upgrade to Druid 0.10 [puppet] - 10https://gerrit.wikimedia.org/r/430296 (https://phabricator.wikimedia.org/T164008) [07:26:22] (03PS1) 10Elukey: role::druid::analytics::worker: upgrade zookeeper to 3.4.9 [puppet] - 10https://gerrit.wikimedia.org/r/430298 (https://phabricator.wikimedia.org/T164008) [07:27:40] !log restart db1098 for upgrade and validation [07:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:54] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11096/druid1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/430298 (https://phabricator.wikimedia.org/T164008) (owner: 10Elukey) [07:31:48] !log upgrade zookeeper on druid100[1-3] to 3.4.9 - T164008 [07:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:52] T164008: Update druid to 0.10 - https://phabricator.wikimedia.org/T164008 [07:31:56] (03CR) 10Elukey: [C: 032] role::druid::analytics::worker: upgrade zookeeper to 3.4.9 [puppet] - 10https://gerrit.wikimedia.org/r/430298 (https://phabricator.wikimedia.org/T164008) (owner: 10Elukey) [07:33:16] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review, 10Wikimedia-Incident: Icinga: page in case all MediaWiki are throwing 5xx - https://phabricator.wikimedia.org/T186069#4173634 (10fgiunchedi) There has been a spike of 500s yesterday in codfw, looks like from `search.wikimedia.org` (tracked at T193... [07:36:04] !log elasticsearch eqiad rolling restart for plugin update and NUMA config - T191543 / T191236 [07:36:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:09] T191543: Deploy updated search/extra plugin and search/extra-analysis-slovak plugin with Slovak Stemmer - https://phabricator.wikimedia.org/T191543 [07:36:09] T191236: Resolve elasticsearch latency alerts - https://phabricator.wikimedia.org/T191236 [07:38:38] mw on mw2174 has been spewing memcache errors for some hours now, not pooled though, I'm assuming after reimage [07:42:28] !log remove openjdk-7 related packages from druid100[1-3] after zookeeper upgrade [07:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:46] (03PS3) 10Jcrespo: Revert "db1098.yaml: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/430032 (owner: 10Marostegui) [07:47:07] (03CR) 10Jcrespo: [C: 032] "I am going to enable notifications, but not pool it yet until a pass of compare.py" [puppet] - 10https://gerrit.wikimedia.org/r/430032 (owner: 10Marostegui) [07:48:18] 10Operations, 10Discovery-Search (Current work): search.wikimedia.org is source of lots of 500s - https://phabricator.wikimedia.org/T179266#4173662 (10Gehel) [07:49:45] RECOVERY - Check systemd state on mw2174 is OK: OK - running: The system is fully operational [08:05:05] (03PS2) 10Elukey: role::druid::public::worker: prep work before upgrade to Druid 0.10 [puppet] - 10https://gerrit.wikimedia.org/r/430296 (https://phabricator.wikimedia.org/T164008) [08:05:41] (03CR) 10Elukey: [C: 032] role::druid::public::worker: prep work before upgrade to Druid 0.10 [puppet] - 10https://gerrit.wikimedia.org/r/430296 (https://phabricator.wikimedia.org/T164008) (owner: 10Elukey) [08:08:53] (03PS3) 10Volans: wmf-auto-reimage: verify BIOS boot parameters [puppet] - 10https://gerrit.wikimedia.org/r/429229 [08:08:55] (03PS3) 10Volans: wmf-auto-reimage: allow to mask systemd services [puppet] - 10https://gerrit.wikimedia.org/r/429230 [08:08:57] (03PS2) 10Volans: wmf-auto-reimage: increase timeout for Puppet cert [puppet] - 10https://gerrit.wikimedia.org/r/429738 [08:10:36] (03CR) 10Ema: [C: 032] 5.1.3-1wm8: add patches included in 4.1.10 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/429839 (https://phabricator.wikimedia.org/T192368) (owner: 10Ema) [08:11:25] !log upgrading Druid to 0.10 on druid100[4-6] (wikistats 2 backend) - T164008 [08:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:29] T164008: Update druid to 0.10 - https://phabricator.wikimedia.org/T164008 [08:12:14] (03CR) 10Vgutierrez: [C: 031] wmfusercontent.org: add SPF record to disable email [dns] - 10https://gerrit.wikimedia.org/r/430008 (https://phabricator.wikimedia.org/T193408) (owner: 10Dzahn) [08:13:26] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 58 probes of 299 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [08:22:05] !log varnish 5.1.3-1wm8 uploaded to apt.w.o T192368 [08:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:09] T192368: Unconditional return(deliver) in vcl_hit - https://phabricator.wikimedia.org/T192368 [08:25:36] (03PS4) 10Volans: wmf-auto-reimage: verify BIOS boot parameters [puppet] - 10https://gerrit.wikimedia.org/r/429229 [08:25:42] (03PS4) 10Volans: wmf-auto-reimage: allow to mask systemd services [puppet] - 10https://gerrit.wikimedia.org/r/429230 [08:25:48] (03PS3) 10Volans: wmf-auto-reimage: increase timeout for Puppet cert [puppet] - 10https://gerrit.wikimedia.org/r/429738 [08:26:31] (03CR) 10Volans: [C: 032] wmf-auto-reimage: verify BIOS boot parameters [puppet] - 10https://gerrit.wikimedia.org/r/429229 (owner: 10Volans) [08:26:44] (03CR) 10Volans: [C: 032] wmf-auto-reimage: allow to mask systemd services [puppet] - 10https://gerrit.wikimedia.org/r/429230 (owner: 10Volans) [08:27:00] (03CR) 10Volans: [C: 032] wmf-auto-reimage: increase timeout for Puppet cert [puppet] - 10https://gerrit.wikimedia.org/r/429738 (owner: 10Volans) [08:28:13] (03PS3) 10Filippo Giunchedi: k8s: simplify prometheus alerts with recording rules [puppet] - 10https://gerrit.wikimedia.org/r/429416 (https://phabricator.wikimedia.org/T193186) [08:29:16] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, 10Elasticsearch: Alert when elasticsearch writes are frozen for too long - https://phabricator.wikimedia.org/T193605#4173733 (10Gehel) [08:29:19] (03CR) 10Filippo Giunchedi: [C: 032] k8s: simplify prometheus alerts with recording rules [puppet] - 10https://gerrit.wikimedia.org/r/429416 (https://phabricator.wikimedia.org/T193186) (owner: 10Filippo Giunchedi) [08:35:31] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 14 probes of 299 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [08:36:02] !log reimaging mw1228, mw1229, mw1230 to stretch (those were logged to SAL before, but failed with IPMI issues before) [08:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:16] (03PS1) 10Filippo Giunchedi: kubernetes: allow NaN for Prometheus alerts [puppet] - 10https://gerrit.wikimedia.org/r/430299 (https://phabricator.wikimedia.org/T193186) [08:44:11] (03CR) 10Filippo Giunchedi: [C: 032] kubernetes: allow NaN for Prometheus alerts [puppet] - 10https://gerrit.wikimedia.org/r/430299 (https://phabricator.wikimedia.org/T193186) (owner: 10Filippo Giunchedi) [08:44:20] (03PS1) 10Vgutierrez: install_server: Reimage lvs2003 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/430300 (https://phabricator.wikimedia.org/T191897) [08:53:53] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage lvs2003 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/430300 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [08:54:15] !log reimaging mw1250, mw1254, mw1255 (app servers) to stretch [08:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:05] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [09:02:31] (03PS2) 10Vgutierrez: install_server: Reimage lvs2003 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/430300 (https://phabricator.wikimedia.org/T191897) [09:03:15] PROBLEM - Request latencies on argon is CRITICAL: (null) https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:03:15] PROBLEM - Request latencies on chlorine is CRITICAL: (null) https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:03:24] PROBLEM - Request latencies on acrab is CRITICAL: (null) https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:03:34] PROBLEM - Request latencies on neon is CRITICAL: (null) https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:04:22] godog: related to your merge? ^^^ [09:04:37] hopefully :) [09:04:52] volans: likely, I'll take a look [09:05:09] I don't see apparent issues on the graphs [09:08:14] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [09:08:52] (03PS1) 10Ema: cache_text vtc: send Host header [puppet] - 10https://gerrit.wikimedia.org/r/430303 [09:10:05] !log Depool lvs2003 and reimage as stretch - T191897 [09:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:10] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [09:10:19] (03CR) 10Ema: [C: 032] cache_text vtc: send Host header [puppet] - 10https://gerrit.wikimedia.org/r/430303 (owner: 10Ema) [09:20:31] PROBLEM - mediawiki-installation DSH group on mw1228 is CRITICAL: Host mw1228 is not in mediawiki-installation dsh group [09:20:31] PROBLEM - mediawiki-installation DSH group on mw1229 is CRITICAL: Host mw1229 is not in mediawiki-installation dsh group [09:20:31] PROBLEM - HHVM processes on mw1228 is CRITICAL: Return code of 255 is out of bounds [09:20:31] PROBLEM - HHVM processes on mw1229 is CRITICAL: Return code of 255 is out of bounds [09:20:31] PROBLEM - HHVM processes on mw1230 is CRITICAL: Return code of 255 is out of bounds [09:20:31] PROBLEM - mediawiki-installation DSH group on mw1230 is CRITICAL: Host mw1230 is not in mediawiki-installation dsh group [09:21:04] (03PS1) 10Gilles: Fix python3 support for python-logstash [debs/python-logstash] - 10https://gerrit.wikimedia.org/r/430305 (https://phabricator.wikimedia.org/T193488) [09:21:08] (03PS1) 10Gilles: Add .gitreview file [debs/python-logstash] - 10https://gerrit.wikimedia.org/r/430306 [09:21:21] PROBLEM - Request latencies on acrux is CRITICAL: (null) https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:21:51] PROBLEM - High CPU load on API appserver on mw1228 is CRITICAL: Return code of 255 is out of bounds [09:21:51] PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: Return code of 255 is out of bounds [09:21:51] PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: Return code of 255 is out of bounds [09:22:10] PROBLEM - HHVM rendering on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 80: Connection refused [09:22:11] PROBLEM - HHVM rendering on mw1229 is CRITICAL: connect to address 10.64.48.64 and port 80: Connection refused [09:22:11] PROBLEM - HHVM rendering on mw1230 is CRITICAL: connect to address 10.64.48.65 and port 80: Connection refused [09:22:11] PROBLEM - nutcracker port on mw1228 is CRITICAL: Return code of 255 is out of bounds [09:22:11] PROBLEM - nutcracker port on mw1229 is CRITICAL: Return code of 255 is out of bounds [09:22:11] PROBLEM - nutcracker port on mw1230 is CRITICAL: Return code of 255 is out of bounds [09:23:06] ^reimage spam, silencing [09:29:06] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4173920 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs2003.codfw.wmnet ``` The log can be found in `/var/lo... [09:32:35] (03PS1) 10Elukey: profile::prometheus::alerts: add druid alerts for available segments [puppet] - 10https://gerrit.wikimedia.org/r/430312 (https://phabricator.wikimedia.org/T164008) [09:33:31] (03CR) 10Elukey: [C: 032] profile::prometheus::alerts: add druid alerts for available segments [puppet] - 10https://gerrit.wikimedia.org/r/430312 (https://phabricator.wikimedia.org/T164008) (owner: 10Elukey) [09:35:20] (03PS2) 10Ema: varnishmedia: remove python daemon [puppet] - 10https://gerrit.wikimedia.org/r/429833 (https://phabricator.wikimedia.org/T184942) [09:38:23] 10Operations, 10Performance-Team, 10Traffic: Update Media dashboard in Grafana to use Prometheus metrics - https://phabricator.wikimedia.org/T193445#4173951 (10ema) p:05Triage>03Normal [09:39:52] PROBLEM - HHVM rendering on mw1250 is CRITICAL: connect to address 10.64.48.85 and port 80: Connection refused [09:39:52] PROBLEM - Apache HTTP on mw1255 is CRITICAL: connect to address 10.64.48.90 and port 80: Connection refused [09:39:52] PROBLEM - nutcracker process on mw1250 is CRITICAL: Return code of 255 is out of bounds [09:39:52] PROBLEM - MD RAID on mw1255 is CRITICAL: Return code of 255 is out of bounds [09:41:32] PROBLEM - Apache HTTP on mw1254 is CRITICAL: connect to address 10.64.48.89 and port 80: Connection refused [09:41:32] PROBLEM - puppet last run on mw1250 is CRITICAL: Return code of 255 is out of bounds [09:41:33] PROBLEM - MD RAID on mw1254 is CRITICAL: Return code of 255 is out of bounds [09:41:33] PROBLEM - Check size of conntrack table on mw1255 is CRITICAL: Return code of 255 is out of bounds [09:42:16] (03CR) 10Ema: "> Find out :) - https://gist.github.com/Krinkle/b5ceff5156c1f4cf3568e373cc135bad" [puppet] - 10https://gerrit.wikimedia.org/r/429833 (https://phabricator.wikimedia.org/T184942) (owner: 10Ema) [09:45:32] (03PS1) 10Jcrespo: mariadb: Setup db1121 into s4 section [puppet] - 10https://gerrit.wikimedia.org/r/430314 (https://phabricator.wikimedia.org/T192979) [09:47:50] RECOVERY - Request latencies on acrab is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:50:55] (03PS1) 10Filippo Giunchedi: kubernetes: escape exclamation mark in apiserver latency check [puppet] - 10https://gerrit.wikimedia.org/r/430315 (https://phabricator.wikimedia.org/T193186) [09:51:27] (03CR) 10jerkins-bot: [V: 04-1] kubernetes: escape exclamation mark in apiserver latency check [puppet] - 10https://gerrit.wikimedia.org/r/430315 (https://phabricator.wikimedia.org/T193186) (owner: 10Filippo Giunchedi) [09:52:53] (03PS2) 10Filippo Giunchedi: kubernetes: escape exclamation mark in apiserver latency check [puppet] - 10https://gerrit.wikimedia.org/r/430315 (https://phabricator.wikimedia.org/T193186) [09:53:45] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4173979 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs2003.codfw.wmnet'] ``` and were **ALL** successful. [09:55:10] 10Operations, 10Puppet, 10Release-Engineering-Team, 10puppet-compiler, 10Continuous-Integration-Config: Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532#4173980 (10Paladox) [09:55:20] (03PS1) 10Jcrespo: mariadb: Depool db1064 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430317 (https://phabricator.wikimedia.org/T192979) [09:55:37] (03PS1) 10Elukey: role::druid::analytics::worker: enable new Druid SQL feature [puppet] - 10https://gerrit.wikimedia.org/r/430318 (https://phabricator.wikimedia.org/T164008) [09:58:33] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1064 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430317 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [09:59:48] (03Merged) 10jenkins-bot: mariadb: Depool db1064 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430317 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [10:00:05] (03CR) 10jenkins-bot: mariadb: Depool db1064 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430317 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [10:00:43] (03CR) 10Filippo Giunchedi: [C: 032] kubernetes: escape exclamation mark in apiserver latency check [puppet] - 10https://gerrit.wikimedia.org/r/430315 (https://phabricator.wikimedia.org/T193186) (owner: 10Filippo Giunchedi) [10:06:01] RECOVERY - Request latencies on argon is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:07:03] (03PS1) 10Vgutierrez: pybal: Re-enable BGP in lvs2003 [puppet] - 10https://gerrit.wikimedia.org/r/430321 (https://phabricator.wikimedia.org/T191897) [10:07:54] (03CR) 10Vgutierrez: [C: 032] pybal: Re-enable BGP in lvs2003 [puppet] - 10https://gerrit.wikimedia.org/r/430321 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [10:08:38] (03PS2) 10Ladsgroup: mediawiki: Add clearTermSqlIndexSearchFields for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/427202 (https://phabricator.wikimedia.org/T189779) [10:17:18] !log Repool lvs2003 - T191897 [10:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:23] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [10:18:37] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4174051 (10Vgutierrez) [10:19:08] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4138499 (10Vgutierrez) [10:19:11] 10Operations, 10Pybal, 10Traffic, 10netops: Rename lvs* LLDP port descriptions after upgrading to stretch - https://phabricator.wikimedia.org/T192087#4174052 (10Vgutierrez) 05Resolved>03Open [10:20:11] (03CR) 10Muehlenhoff: [C: 031] "That's okay. We use kernel-level hardening to prevent symlink attacks and while PrivateTmp also addresses some other issues, it's okay to " [puppet] - 10https://gerrit.wikimedia.org/r/430049 (owner: 10Gehel) [10:21:42] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1064 (duration: 01m 17s) [10:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:53] jouncebot: next [10:22:53] In 2 hour(s) and 37 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180502T1300) [10:25:26] (03PS1) 10Vgutierrez: install_server: Reimage lvs2002 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/430323 (https://phabricator.wikimedia.org/T191897) [10:25:50] !log Depool and reimage lvs2002 as stretch - T191897 [10:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:54] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [10:26:35] (03CR) 10Jcrespo: [C: 032] mariadb: Setup db1121 into s4 section [puppet] - 10https://gerrit.wikimedia.org/r/430314 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [10:26:41] (03PS2) 10Jcrespo: mariadb: Setup db1121 into s4 section [puppet] - 10https://gerrit.wikimedia.org/r/430314 (https://phabricator.wikimedia.org/T192979) [10:28:26] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage lvs2002 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/430323 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [10:28:46] (03PS3) 10Jcrespo: mariadb: Setup db1121 into s4 section [puppet] - 10https://gerrit.wikimedia.org/r/430314 (https://phabricator.wikimedia.org/T192979) [10:29:12] RECOVERY - Request latencies on chlorine is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:29:22] RECOVERY - Request latencies on neon is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:29:41] PROBLEM - PyBal backends health check on lvs2002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [10:29:55] ^^ that's me reimaging [10:30:02] PROBLEM - PyBal connections to etcd on lvs2002 is CRITICAL: CRITICAL: 0 connections established with conf2001.codfw.wmnet:2379 (min=14) [10:30:02] (03CR) 10Filippo Giunchedi: [C: 032] Fix python3 support for python-logstash [debs/python-logstash] - 10https://gerrit.wikimedia.org/r/430305 (https://phabricator.wikimedia.org/T193488) (owner: 10Gilles) [10:30:21] PROBLEM - pybal on lvs2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [10:30:46] (actually just depooling) [10:30:50] !log installing openjdk-8 security updates on stat hosts [10:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:57] (03PS3) 10MarcoAurelio: euwikisource: add Author namespace, add English alias as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429400 (https://phabricator.wikimedia.org/T193225) [10:38:37] (03PS14) 10Volans: Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) [10:38:39] (03PS10) 10Volans: Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) [10:38:41] (03PS5) 10Volans: Add server side validation of client certificates [software/debmonitor] - 10https://gerrit.wikimedia.org/r/428302 (https://phabricator.wikimedia.org/T167504) [10:38:59] (03CR) 10jerkins-bot: [V: 04-1] Add basic test coverage [software/debmonitor] - 10https://gerrit.wikimedia.org/r/394621 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [10:39:01] (03CR) 10jerkins-bot: [V: 04-1] Add login and LDAP support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/425417 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [10:39:04] (03CR) 10jerkins-bot: [V: 04-1] Add server side validation of client certificates [software/debmonitor] - 10https://gerrit.wikimedia.org/r/428302 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [10:39:06] (03PS3) 10Ladsgroup: mediawiki: Add clearTermSqlIndexSearchFields for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/427202 (https://phabricator.wikimedia.org/T189779) [10:39:23] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4174122 (10jcrespo) s6 main_tables.txt have been checked, no errors found, now checking s7 instance: ``` $ cat s7.dblist | while read db; do cat main_tables.txt | while read t... [10:39:24] (03CR) 10Volans: "Comment addressed, see inline." (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/428302 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [10:39:32] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Add clearTermSqlIndexSearchFields for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/427202 (https://phabricator.wikimedia.org/T189779) (owner: 10Ladsgroup) [10:40:32] (03PS4) 10Ladsgroup: mediawiki: Add clearTermSqlIndexSearchFields for wikidata [puppet] - 10https://gerrit.wikimedia.org/r/427202 (https://phabricator.wikimedia.org/T189779) [10:42:03] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4174159 (10Vgutierrez) >>! In T184293#4162375, @Cmjohnson wrote: > @ayounsi Can you create a subnet for LVS for row D please. According to https://wikitech.wikimedia.org/wi... [10:43:04] RECOVERY - Request latencies on acrux is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:45:58] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4174181 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs2002.codfw.wmnet ``` The log can be found in `/var/lo... [10:48:57] !log reimaging mw2200 to stretch [10:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:30] "Request from 88.97.96.89 via cp1063 cp1063, Varnish XID 121726507 [10:54:30] Error: 429, Too Many Requests at Wed, 02 May 2018 10:54:06 GMT" [10:54:43] Server issues? [10:55:32] 429 is Too Many Requests, literally [10:55:33] (03PS1) 10Mark Bergsma: Create MonitoringProtocolTestCase base class [debs/pybal] - 10https://gerrit.wikimedia.org/r/430337 [10:55:35] (03PS1) 10Mark Bergsma: Add minimal test cases for Skeleton and ProxyFetch [debs/pybal] - 10https://gerrit.wikimedia.org/r/430338 [10:55:37] (03PS1) 10Mark Bergsma: Add unit tests for ProxyFetchMonitoringProtocol [debs/pybal] - 10https://gerrit.wikimedia.org/r/430339 [10:55:51] server is asking to throttle because request are coming too fast [10:56:07] !log kartik@tin Started deploy [cxserver/deploy@0aa3532]: Update cxserver to a20bf75 [10:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:12] (only to the client it responds to) [10:56:37] (03CR) 10jerkins-bot: [V: 04-1] Add unit tests for ProxyFetchMonitoringProtocol [debs/pybal] - 10https://gerrit.wikimedia.org/r/430339 (owner: 10Mark Bergsma) [11:02:08] !log kartik@tin Finished deploy [cxserver/deploy@0aa3532]: Update cxserver to a20bf75 (duration: 06m 01s) [11:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:39] (03CR) 10Mobrovac: [C: 04-1] "Let's also switch MassMessage/MassMessageSubmitJob and SecurePoll/PopulateVoterListJob for everything (both here and in the CP4JQ patch)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429980 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [11:06:02] (03PS2) 10Mark Bergsma: Add unit tests for ProxyFetchMonitoringProtocol [debs/pybal] - 10https://gerrit.wikimedia.org/r/430339 [11:06:56] (03CR) 10jerkins-bot: [V: 04-1] Add unit tests for ProxyFetchMonitoringProtocol [debs/pybal] - 10https://gerrit.wikimedia.org/r/430339 (owner: 10Mark Bergsma) [11:12:31] !log stopping db1064 for cloning to db1121 (will create temporary lag on commons wikireplicas) [11:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:32] PROBLEM - Check size of conntrack table on mw1229 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:23:33] PROBLEM - Check whether ferm is active by checking the default input chain on mw1229 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:23:33] PROBLEM - nutcracker process on mw1229 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:23:42] PROBLEM - configured eth on mw1228 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:23:42] PROBLEM - Check size of conntrack table on mw1230 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:23:43] PROBLEM - dhclient process on mw1228 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:23:43] PROBLEM - Disk space on mw1228 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:23:43] PROBLEM - Disk space on mw1229 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:23:43] PROBLEM - dhclient process on mw1229 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:23:43] PROBLEM - Disk space on mw1230 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:23:43] PROBLEM - dhclient process on mw1230 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:23:43] PROBLEM - configured eth on mw1229 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:23:44] PROBLEM - Nginx local proxy to apache on mw1229 is CRITICAL: connect to address 10.64.48.64 and port 443: Connection refused [11:23:44] PROBLEM - DPKG on mw1230 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:23:52] PROBLEM - DPKG on mw1228 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:23:53] PROBLEM - Nginx local proxy to apache on mw1228 is CRITICAL: connect to address 10.64.48.63 and port 443: Connection refused [11:24:02] PROBLEM - Apache HTTP on mw1230 is CRITICAL: connect to address 10.64.48.65 and port 80: Connection refused [11:24:02] PROBLEM - MD RAID on mw1228 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:24:02] PROBLEM - Check systemd state on mw1230 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:24:02] PROBLEM - MD RAID on mw1230 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:24:02] PROBLEM - Nginx local proxy to apache on mw1230 is CRITICAL: connect to address 10.64.48.65 and port 443: Connection refused [11:24:03] PROBLEM - Check systemd state on mw1228 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:24:03] PROBLEM - nutcracker process on mw1230 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:24:06] hmmm moritzm reimaging? [11:24:10] reimage yeah [11:24:12] PROBLEM - DPKG on mw1229 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:24:13] PROBLEM - Check whether ferm is active by checking the default input chain on mw1228 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:24:19] yep, silencing [11:24:22] PROBLEM - nutcracker process on mw1228 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:24:22] PROBLEM - MD RAID on mw1229 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:24:22] PROBLEM - Check whether ferm is active by checking the default input chain on mw1230 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:24:23] PROBLEM - configured eth on mw1230 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:24:23] PROBLEM - Check size of conntrack table on mw1228 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:24:23] PROBLEM - Check systemd state on mw1229 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [11:24:53] PROBLEM - MariaDB Slave Lag: s4 on db1102 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 630.99 seconds [11:26:22] RECOVERY - Apache HTTP on mw1254 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.002 second response time [11:27:12] RECOVERY - Apache HTTP on mw1255 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.001 second response time [11:29:23] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4174284 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs2002.codfw.wmnet'] ``` and were **ALL** successful. [11:35:09] (03PS1) 10Vgutierrez: pybal: Re-enable BGP in lvs2002 [puppet] - 10https://gerrit.wikimedia.org/r/430346 (https://phabricator.wikimedia.org/T191897) [11:36:45] (03CR) 10Vgutierrez: [C: 032] pybal: Re-enable BGP in lvs2002 [puppet] - 10https://gerrit.wikimedia.org/r/430346 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [11:41:09] !log Repool lvs2002 - T191897 [11:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:13] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [11:43:25] ACKNOWLEDGEMENT - MariaDB Slave Lag: s4 on db1102 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1718.09 seconds Jcrespo ongoing maintenance on its master [11:44:53] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4174322 (10Vgutierrez) [11:46:22] 10Operations, 10Pybal, 10Traffic, 10netops: Rename lvs* LLDP port descriptions after upgrading to stretch - https://phabricator.wikimedia.org/T192087#4174328 (10Vgutierrez) [11:46:23] (03PS1) 10Volans: wmf-auto-reimage: increase first Puppet run timeout [puppet] - 10https://gerrit.wikimedia.org/r/430351 [11:47:26] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/430351 (owner: 10Volans) [11:47:51] (03PS2) 10Volans: wmf-auto-reimage: increase first Puppet run timeout [puppet] - 10https://gerrit.wikimedia.org/r/430351 [11:48:40] (03CR) 10Volans: [C: 032] wmf-auto-reimage: increase first Puppet run timeout [puppet] - 10https://gerrit.wikimedia.org/r/430351 (owner: 10Volans) [11:49:32] RECOVERY - Check whether ferm is active by checking the default input chain on mw1228 is OK: OK ferm input default policy is set [11:49:32] RECOVERY - MD RAID on mw1229 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [11:49:42] RECOVERY - HHVM processes on mw1229 is OK: PROCS OK: 1 process with command name hhvm [11:49:42] RECOVERY - HHVM processes on mw1228 is OK: PROCS OK: 1 process with command name hhvm [11:49:42] RECOVERY - Check size of conntrack table on mw1228 is OK: OK: nf_conntrack is 0 % full [11:49:43] RECOVERY - Check size of conntrack table on mw1229 is OK: OK: nf_conntrack is 0 % full [11:49:43] RECOVERY - High CPU load on API appserver on mw1229 is OK: OK - load average: 8.39, 5.17, 2.42 [11:49:43] RECOVERY - Check whether ferm is active by checking the default input chain on mw1229 is OK: OK ferm input default policy is set [11:49:52] RECOVERY - configured eth on mw1228 is OK: OK - interfaces up [11:49:53] RECOVERY - Disk space on mw1229 is OK: DISK OK [11:49:53] RECOVERY - Disk space on mw1228 is OK: DISK OK [11:49:53] RECOVERY - dhclient process on mw1228 is OK: PROCS OK: 0 processes with command name dhclient [11:49:53] RECOVERY - dhclient process on mw1229 is OK: PROCS OK: 0 processes with command name dhclient [11:50:02] RECOVERY - configured eth on mw1229 is OK: OK - interfaces up [11:50:03] RECOVERY - DPKG on mw1228 is OK: All packages OK [11:50:13] RECOVERY - MD RAID on mw1228 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [11:50:22] RECOVERY - High CPU load on API appserver on mw1228 is OK: OK - load average: 5.51, 5.10, 2.55 [11:50:22] RECOVERY - DPKG on mw1229 is OK: All packages OK [11:56:02] RECOVERY - Nginx local proxy to apache on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 2.456 second response time [11:59:43] PROBLEM - High CPU load on API appserver on mw2200 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:00:13] RECOVERY - Apache HTTP on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.001 second response time [12:00:23] RECOVERY - Nginx local proxy to apache on mw1228 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 2.231 second response time [12:01:43] PROBLEM - nutcracker process on mw2200 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:03:24] PROBLEM - puppet last run on mw2200 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:04:14] RECOVERY - nutcracker port on mw1229 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [12:04:23] RECOVERY - Check systemd state on mw1229 is OK: OK - running: The system is fully operational [12:05:04] PROBLEM - Apache HTTP on mw2200 is CRITICAL: connect to address 10.192.32.88 and port 80: Connection refused [12:06:33] RECOVERY - HHVM rendering on mw1229 is OK: HTTP OK: HTTP/1.1 200 OK - 75415 bytes in 0.144 second response time [12:06:53] PROBLEM - Check size of conntrack table on mw2200 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:06:53] PROBLEM - MD RAID on mw2200 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:07:33] RECOVERY - Check systemd state on mw1228 is OK: OK - unknown: The operational state could not be determined, due to lack of resources or another error cause. [12:08:33] PROBLEM - Check systemd state on mw2200 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:09:44] RECOVERY - nutcracker process on mw1228 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [12:09:54] RECOVERY - nutcracker process on mw1229 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [12:10:04] RECOVERY - nutcracker port on mw1228 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [12:10:13] (03PS1) 10Vgutierrez: install_server: Reimage lvs2001 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/430354 (https://phabricator.wikimedia.org/T191897) [12:10:14] PROBLEM - Nginx local proxy to apache on mw2200 is CRITICAL: connect to address 10.192.32.88 and port 443: Connection refused [12:10:14] PROBLEM - Check the NTP synchronisation status of timesyncd on mw2200 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:11:30] ^ mw2220 is reimage [12:12:03] PROBLEM - Check whether ferm is active by checking the default input chain on mw2200 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:12:54] RECOVERY - HHVM rendering on mw1228 is OK: HTTP OK: HTTP/1.1 200 OK - 75417 bytes in 1.452 second response time [12:13:43] PROBLEM - DPKG on mw2200 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:13:43] PROBLEM - configured eth on mw2200 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:15:24] PROBLEM - Disk space on mw2200 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:15:24] PROBLEM - dhclient process on mw2200 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:16:26] 10Operations, 10Beta-Cluster-Infrastructure, 10User-Addshore, 10User-Joe: Run mediawiki::maintenance scripts in Beta Cluster - https://phabricator.wikimedia.org/T125976#4174418 (10Joe) [12:17:48] !log Depool and reimage lvs2001 as stretch - T191897 [12:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:53] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [12:20:33] RECOVERY - mediawiki-installation DSH group on mw1228 is OK: OK [12:20:33] RECOVERY - mediawiki-installation DSH group on mw1229 is OK: OK [12:21:53] (03CR) 10Vgutierrez: [C: 032] install_server: Reimage lvs2001 as stretch [puppet] - 10https://gerrit.wikimedia.org/r/430354 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [12:27:12] (03PS4) 10Filippo Giunchedi: base: alert on SMART health failure [puppet] - 10https://gerrit.wikimedia.org/r/427654 (https://phabricator.wikimedia.org/T86552) [12:29:49] (03CR) 10Filippo Giunchedi: [C: 032] base: alert on SMART health failure [puppet] - 10https://gerrit.wikimedia.org/r/427654 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [12:30:17] expect some critical coming in related to smart [12:30:34] yup [12:31:24] RECOVERY - MD RAID on mw1230 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [12:32:18] RECOVERY - Disk space on mw1230 is OK: DISK OK [12:32:18] RECOVERY - Check size of conntrack table on mw1230 is OK: OK: nf_conntrack is 0 % full [12:32:18] RECOVERY - dhclient process on mw1230 is OK: PROCS OK: 0 processes with command name dhclient [12:32:19] RECOVERY - DPKG on mw1230 is OK: All packages OK [12:32:28] RECOVERY - High CPU load on API appserver on mw1230 is OK: OK - load average: 7.13, 12.37, 7.53 [12:34:38] RECOVERY - Nginx local proxy to apache on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 2.464 second response time [12:34:39] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 76011 bytes in 5.379 second response time [12:42:42] RECOVERY - Check systemd state on mw1230 is OK: OK - running: The system is fully operational [12:42:52] RECOVERY - nutcracker process on mw1230 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [12:43:12] PROBLEM - Device not healthy -SMART- on tungsten is CRITICAL: cluster=misc device={megaraid,1,megaraid,8} instance=tungsten:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=tungsten&var-datasource=eqiad%2520prometheus%252Fops [12:44:12] RECOVERY - Check whether ferm is active by checking the default input chain on mw1230 is OK: OK ferm input default policy is set [12:45:41] RECOVERY - nutcracker port on mw1230 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [12:46:01] RECOVERY - HHVM processes on mw1230 is OK: PROCS OK: 6 processes with command name hhvm [12:48:29] !log reimaging mw1340, mw1341, mw1342 (API servers) to stretch [12:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:07] 08Warning Alert for device asw-d-codfw.mgmt.codfw.wmnet - Inbound interface errors [12:49:22] RECOVERY - configured eth on mw1230 is OK: OK - interfaces up [12:50:33] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4174470 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs2001.codfw.wmnet ``` The log can be found in `/var/lo... [12:51:35] (03Restored) 10Hoo man: Increase dispatching resources by about 50% [puppet] - 10https://gerrit.wikimedia.org/r/429662 (https://phabricator.wikimedia.org/T193349) (owner: 10Hoo man) [12:55:23] (03PS3) 10Hoo man: Increase dispatching resources by about 10% [puppet] - 10https://gerrit.wikimedia.org/r/429662 (https://phabricator.wikimedia.org/T193349) [12:56:42] (03CR) 10Ppchelko: "Quick search over the usages of the cross-wiki job posting feature shows that we have quite a few instances of jobs that must be switched " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429980 (https://phabricator.wikimedia.org/T190327) (owner: 10Ppchelko) [12:56:50] _joe_: Can you take a look at https://gerrit.wikimedia.org/r/429662 maybe? [12:57:14] <_joe_> hoo: I'm at a conference, you'd need to find someone else [12:58:41] * hoo eyes apergos and mutante … are you in for a quick merge? [12:59:09] Bit early for SF :P [12:59:25] true [12:59:29] I'm n the middle of comparing outputs, can it wait a while? [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180502T1300). [13:00:05] Gilles and Hauskatze: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] hai [13:00:18] It can, we're fine atm :) [13:00:21] yo [13:01:03] !log re-attempting to reimage mw1250, mw1254, mw1255 (app servers) to stretch, those ran into a timeout earlier which is now fixed in the reimage script [13:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:19] hoo: while you wait for your merge, you can assist gilles and me with swat :P (/me hides) [13:01:30] I can SWAT [13:02:24] Hauskatze: is it just that InitialiseSetting.php change that needs to be deployed? [13:02:49] Hauskatze: If there's anything for me, sure thing ;) [13:03:05] gilles: I'll need a script run afterwards [13:03:06] ACKNOWLEDGEMENT - Device not healthy -SMART- on wasat is CRITICAL: cluster=misc device=sda instance=wasat:9100 job=node site=codfw Filippo Giunchedi T193394 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=wasat&var-datasource=codfw%2520prometheus%252Fops [13:03:19] Hauskatze: maintenance script? [13:03:20] if you're familiar with namespaceDupes [13:03:23] yep [13:03:25] I'm not [13:03:49] * Hauskatze groans [13:03:49] do you have the syntax to invoke it? [13:03:57] yes I do [13:04:09] it can be run later though [13:04:15] PROBLEM - Device not healthy -SMART- on labsdb1005 is CRITICAL: cluster=mysql device=megaraid,8 instance=labsdb1005:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labsdb1005&var-datasource=eqiad%2520prometheus%252Fops [13:04:21] ok, is the change testable with just the config change? [13:04:28] on mwdebug machines [13:04:32] yes, testable on mwdebug [13:04:37] ok, let's go then [13:04:45] (03CR) 10Gilles: [C: 032] euwikisource: add Author namespace, add English alias as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429400 (https://phabricator.wikimedia.org/T193225) (owner: 10MarcoAurelio) [13:04:49] basically to check if the namespace appears, the alias works and the wiki doesn't break [13:05:28] !log Starting mid-day EU SWAT [13:05:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:45] PROBLEM - Device not healthy -SMART- on db1063 is CRITICAL: cluster=mysql device=megaraid,1 instance=db1063:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1063&var-datasource=eqiad%2520prometheus%252Fops [13:05:46] PROBLEM - Device not healthy -SMART- on labnet1002 is CRITICAL: cluster=labs device=sda instance=labnet1002:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labnet1002&var-datasource=eqiad%2520prometheus%252Fops [13:05:48] (03CR) 10Elukey: [C: 032] role::druid::analytics::worker: enable new Druid SQL feature [puppet] - 10https://gerrit.wikimedia.org/r/430318 (https://phabricator.wikimedia.org/T164008) (owner: 10Elukey) [13:05:53] (03PS2) 10Elukey: role::druid::analytics::worker: enable new Druid SQL feature [puppet] - 10https://gerrit.wikimedia.org/r/430318 (https://phabricator.wikimedia.org/T164008) [13:05:55] PROBLEM - Device not healthy -SMART- on labtestvirt2003 is CRITICAL: cluster=labtest device=nbd0 instance=labtestvirt2003:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labtestvirt2003&var-datasource=codfw%2520prometheus%252Fops [13:06:06] (03Merged) 10jenkins-bot: euwikisource: add Author namespace, add English alias as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429400 (https://phabricator.wikimedia.org/T193225) (owner: 10MarcoAurelio) [13:06:48] Hauskatze: your config change is deployed on mwdebug1002, please test [13:07:04] I'm on it. I see it in API and will further test [13:07:25] PROBLEM - Device not healthy -SMART- on db1051 is CRITICAL: cluster=mysql device=megaraid,8 instance=db1051:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1051&var-datasource=eqiad%2520prometheus%252Fops [13:07:54] gilles: looks good to me, the script run later will fix the namespace issues created though [13:09:05] PROBLEM - Device not healthy -SMART- on db1064 is CRITICAL: cluster=mysql device={megaraid,2,megaraid,6} instance=db1064:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1064&var-datasource=eqiad%2520prometheus%252Fops [13:09:06] 10Operations, 10ops-eqiad: tungsten disk 1 and 8 SMART failure - https://phabricator.wikimedia.org/T193628#4174494 (10fgiunchedi) [13:09:16] PROBLEM - Device not healthy -SMART- on snapshot1001 is CRITICAL: cluster=misc device=sda instance=snapshot1001:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=snapshot1001&var-datasource=eqiad%2520prometheus%252Fops [13:09:48] (03CR) 10jenkins-bot: euwikisource: add Author namespace, add English alias as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429400 (https://phabricator.wikimedia.org/T193225) (owner: 10MarcoAurelio) [13:09:56] Hauskatze: ok, I'm going to deploy the config change to prod now [13:10:06] gilles: ok thanks [13:10:47] I have the script syntax for after the deployment: mwscript namespaceDupes.php --wiki=euwikisource (this is a dry-run, please post the output on the Phab task) [13:10:56] ACKNOWLEDGEMENT - Device not healthy -SMART- on tungsten is CRITICAL: cluster=misc device={megaraid,1,megaraid,8} instance=tungsten:9100 job=node site=eqiad Filippo Giunchedi T193628 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=tungsten&var-datasource=eqiad%2520prometheus%252Fops [13:10:57] to be run on terbium [13:12:05] I'll do that right after [13:12:25] PROBLEM - Device not healthy -SMART- on bast3002 is CRITICAL: cluster=misc device=sdb instance=bast3002:9100 job=node site=esams https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=bast3002&var-datasource=esams%2520prometheus%252Fops [13:12:25] PROBLEM - Device not healthy -SMART- on db1055 is CRITICAL: cluster=mysql device={megaraid,1,megaraid,10,megaraid,4} instance=db1055:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1055&var-datasource=eqiad%2520prometheus%252Fops [13:13:26] (03Draft3) 10Bodhisattwa: Enable ULS webfonts by default at Bengali Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430360 (https://phabricator.wikimedia.org/T193367) [13:13:43] !log gilles@tin Synchronized wmf-config/InitialiseSettings.php: T193225 Add Author namespace on eu.wikisource (duration: 01m 20s) [13:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:47] T193225: Add Author namespace on eu.wikisource - https://phabricator.wikimedia.org/T193225 [13:14:29] Hauskatze: the task mentions use of specific parameters? [13:14:48] Hauskatze: I've posted the result of the dry run to the task [13:14:57] for the script? Nope. We usually do a first dry-run and then we run the same with --fix if there's anything to fix [13:15:04] let me check [13:15:05] ok [13:15:46] gilles: okay so the script can resolve all conflicts itself, please re-run with --fix at the end [13:15:55] !log T193225 mwscript namespaceDupes.php --wiki=euwikisource --fix [13:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:05] done [13:16:18] and now run again without --fix to check if there's anything left? [13:16:49] nothing left, pasted the output to the task [13:16:56] great, thanks :) [13:17:00] all good? [13:17:49] (03CR) 10Gilles: [C: 032] Add performance perception QuickSurvey definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421921 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [13:17:51] (03PS3) 10Mark Bergsma: Add unit tests for ProxyFetchMonitoringProtocol [debs/pybal] - 10https://gerrit.wikimedia.org/r/430339 [13:18:16] (03PS4) 10ArielGlenn: Increase dispatching resources by about 10% [puppet] - 10https://gerrit.wikimedia.org/r/429662 (https://phabricator.wikimedia.org/T193349) (owner: 10Hoo man) [13:18:18] gilles: yes all good as far as I can see [13:18:42] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4174538 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs2001.codfw.wmnet'] ``` and were **ALL** successful. [13:18:54] (03CR) 10ArielGlenn: [C: 032] Increase dispatching resources by about 10% [puppet] - 10https://gerrit.wikimedia.org/r/429662 (https://phabricator.wikimedia.org/T193349) (owner: 10Hoo man) [13:19:04] PROBLEM - Device not healthy -SMART- on dbstore1002 is CRITICAL: cluster=mysql device=megaraid,5 instance=dbstore1002:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=dbstore1002&var-datasource=eqiad%2520prometheus%252Fops [13:19:24] (03Abandoned) 10Bodhisattwa: Enable ULS webfonts by default at Bengali Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430360 (https://phabricator.wikimedia.org/T193367) (owner: 10Bodhisattwa) [13:19:41] (03Restored) 10Bodhisattwa: Enable ULS webfonts by default at Bengali Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430360 (https://phabricator.wikimedia.org/T193367) (owner: 10Bodhisattwa) [13:20:22] (03PS7) 10Gilles: Add performance perception QuickSurvey definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421921 (https://phabricator.wikimedia.org/T187299) [13:20:34] RECOVERY - mediawiki-installation DSH group on mw1230 is OK: OK [13:20:39] (03CR) 10Gilles: [C: 032] Add performance perception QuickSurvey definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421921 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [13:20:44] (03CR) 10Volans: "Alternative approach inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/430079 (owner: 10Ottomata) [13:20:44] PROBLEM - Device not healthy -SMART- on db1066 is CRITICAL: cluster=mysql device=megaraid,6 instance=db1066:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1066&var-datasource=eqiad%2520prometheus%252Fops [13:20:52] !log restart druid broker on druid100[1-3] to enable the 'druid.sql.enable' feature [13:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:33] can someone revisit https://wikitech.wikimedia.org/wiki/Deployments#Wednesday,_May_02 [13:22:05] (03Merged) 10jenkins-bot: Add performance perception QuickSurvey definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421921 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [13:22:16] Hauskatze: Hello Marco [13:22:29] Hello, good morning :) [13:22:34] (03CR) 10jenkins-bot: Add performance perception QuickSurvey definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421921 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [13:22:52] Jayprakash12345: I'll try once I'm done with my stuff [13:23:24] ACKNOWLEDGEMENT - Device not healthy -SMART- on snapshot1001 is CRITICAL: cluster=misc device=sda instance=snapshot1001:9100 job=node site=eqiad Filippo Giunchedi To be decom T184616 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=snapshot1001&var-datasource=eqiad%2520prometheus%252Fops [13:24:06] 08̶W̶a̶r̶n̶i̶n̶g Device asw-d-codfw.mgmt.codfw.wmnet recovered from Inbound interface errors [13:24:15] PROBLEM - Device not healthy -SMART- on labstore1003 is CRITICAL: cluster=labsnfs device=megaraid,31 instance=labstore1003:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1003&var-datasource=eqiad%2520prometheus%252Fops [13:24:45] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1051 is CRITICAL: cluster=mysql device=megaraid,8 instance=db1051:9100 job=node site=eqiad Filippo Giunchedi To be decom - T186320 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1051&var-datasource=eqiad%2520prometheus%252Fops [13:24:45] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1053 is CRITICAL: cluster=mysql device={megaraid,3,megaraid,8} instance=db1053:9100 job=node site=eqiad Filippo Giunchedi To be decom - T186320 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1053&var-datasource=eqiad%2520prometheus%252Fops [13:24:45] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1055 is CRITICAL: cluster=mysql device={megaraid,1,megaraid,10,megaraid,4} instance=db1055:9100 job=node site=eqiad Filippo Giunchedi To be decom - T186320 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1055&var-datasource=eqiad%2520prometheus%252Fops [13:25:15] PROBLEM - Device not healthy -SMART- on labsdb1004 is CRITICAL: cluster=mysql device=megaraid,6 instance=labsdb1004:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labsdb1004&var-datasource=eqiad%2520prometheus%252Fops [13:26:34] (03CR) 10Mark Bergsma: [C: 031] Create MonitoringProtocolTestCase base class [debs/pybal] - 10https://gerrit.wikimedia.org/r/430337 (owner: 10Mark Bergsma) [13:26:36] (03PS1) 10Elukey: role::druid::public::worker: upgrade zookeeper to 3.4.9 [puppet] - 10https://gerrit.wikimedia.org/r/430362 (https://phabricator.wikimedia.org/T164008) [13:26:51] (03PS2) 10Mark Bergsma: Create MonitoringProtocolTestCase base class [debs/pybal] - 10https://gerrit.wikimedia.org/r/430337 [13:26:53] (03PS2) 10Mark Bergsma: Add minimal test cases for Skeleton and ProxyFetch [debs/pybal] - 10https://gerrit.wikimedia.org/r/430338 [13:26:55] (03PS4) 10Mark Bergsma: Add unit tests for ProxyFetchMonitoringProtocol [debs/pybal] - 10https://gerrit.wikimedia.org/r/430339 [13:27:25] ACKNOWLEDGEMENT - Device not healthy -SMART- on bast3002 is CRITICAL: cluster=misc device=sdb instance=bast3002:9100 job=node site=esams Filippo Giunchedi known - T169035 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=bast3002&var-datasource=esams%2520prometheus%252Fops [13:27:33] (03PS1) 10Gilles: Fix stray space in quicksurvey configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430363 (https://phabricator.wikimedia.org/T187299) [13:27:45] (03CR) 10Jayprakash12345: [C: 031] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430360 (https://phabricator.wikimedia.org/T193367) (owner: 10Bodhisattwa) [13:27:47] (03PS2) 10Gilles: Fix stray space in quicksurvey configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430363 (https://phabricator.wikimedia.org/T187299) [13:27:51] (03CR) 10jerkins-bot: [V: 04-1] Add minimal test cases for Skeleton and ProxyFetch [debs/pybal] - 10https://gerrit.wikimedia.org/r/430338 (owner: 10Mark Bergsma) [13:28:05] RECOVERY - Apache HTTP on mw2200 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.077 second response time [13:28:36] (03CR) 10Elukey: [C: 032] role::druid::public::worker: upgrade zookeeper to 3.4.9 [puppet] - 10https://gerrit.wikimedia.org/r/430362 (https://phabricator.wikimedia.org/T164008) (owner: 10Elukey) [13:29:03] !log upgrade zookeeper to 3.4.9 on druid100[4-6] (wikistats 2 backend) - T164008 [13:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:07] T164008: Update druid to 0.10 - https://phabricator.wikimedia.org/T164008 [13:29:11] (03PS3) 10Mark Bergsma: Create MonitoringProtocolTestCase base class [debs/pybal] - 10https://gerrit.wikimedia.org/r/430337 [13:29:13] (03PS3) 10Mark Bergsma: Add minimal test cases for Skeleton and ProxyFetch [debs/pybal] - 10https://gerrit.wikimedia.org/r/430338 [13:29:15] (03PS5) 10Mark Bergsma: Add unit tests for ProxyFetchMonitoringProtocol [debs/pybal] - 10https://gerrit.wikimedia.org/r/430339 [13:29:21] 10Operations, 10Beta-Cluster-Infrastructure, 10User-Addshore, 10User-Joe: Run mediawiki::maintenance scripts in Beta Cluster - https://phabricator.wikimedia.org/T125976#4174565 (10MarcoAurelio) @Joe I don't think anyone is working on this atm. Anyone should feel free to take on this one. [13:29:23] (03CR) 10Gilles: [C: 032] Fix stray space in quicksurvey configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430363 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [13:30:32] ACKNOWLEDGEMENT - Device not healthy -SMART- on tin is CRITICAL: cluster=misc device=sat+megaraid,0 instance=tin:9100 job=node site=eqiad Filippo Giunchedi known - T174449 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=tin&var-datasource=eqiad%2520prometheus%252Fops [13:30:35] (03Merged) 10jenkins-bot: Fix stray space in quicksurvey configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430363 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [13:30:37] (03CR) 10Mark Bergsma: [C: 031] Create MonitoringProtocolTestCase base class [debs/pybal] - 10https://gerrit.wikimedia.org/r/430337 (owner: 10Mark Bergsma) [13:31:59] (03CR) 10Mark Bergsma: [C: 031] Add minimal test cases for Skeleton and ProxyFetch [debs/pybal] - 10https://gerrit.wikimedia.org/r/430338 (owner: 10Mark Bergsma) [13:32:17] PROBLEM - Nginx local proxy to apache on mw1340 is CRITICAL: connect to address 10.64.32.52 and port 443: Connection refused [13:32:17] PROBLEM - Check whether ferm is active by checking the default input chain on mw1340 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:32:18] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1341 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:32:18] PROBLEM - Check systemd state on mw1342 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:32:18] PROBLEM - MD RAID on mw1342 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:32:25] (03PS1) 10Vgutierrez: pybal: Re-enable BGP in lvs2001 [puppet] - 10https://gerrit.wikimedia.org/r/430366 (https://phabricator.wikimedia.org/T191897) [13:33:57] PROBLEM - DPKG on mw1340 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:33:57] PROBLEM - Check whether ferm is active by checking the default input chain on mw1341 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:33:57] PROBLEM - Nginx local proxy to apache on mw1341 is CRITICAL: connect to address 10.64.32.53 and port 443: Connection refused [13:33:57] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1342 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:33:59] (03CR) 10Vgutierrez: [C: 032] pybal: Re-enable BGP in lvs2001 [puppet] - 10https://gerrit.wikimedia.org/r/430366 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [13:34:05] (03PS2) 10Vgutierrez: pybal: Re-enable BGP in lvs2001 [puppet] - 10https://gerrit.wikimedia.org/r/430366 (https://phabricator.wikimedia.org/T191897) [13:35:07] PROBLEM - Device not healthy -SMART- on db1073 is CRITICAL: cluster=mysql device={megaraid,7,megaraid,9} instance=db1073:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1073&var-datasource=eqiad%2520prometheus%252Fops [13:35:23] (03PS1) 10Jcrespo: mariadb: Add new node db1121 to mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430368 (https://phabricator.wikimedia.org/T192979) [13:35:37] PROBLEM - Nginx local proxy to apache on mw1342 is CRITICAL: connect to address 10.64.32.54 and port 443: Connection refused [13:35:37] PROBLEM - Check whether ferm is active by checking the default input chain on mw1342 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:35:37] PROBLEM - configured eth on mw1340 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:35:37] PROBLEM - DPKG on mw1341 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [13:36:18] (03CR) 10Mark Bergsma: [C: 031] Add unit tests for ProxyFetchMonitoringProtocol [debs/pybal] - 10https://gerrit.wikimedia.org/r/430339 (owner: 10Mark Bergsma) [13:36:32] Hauskatze: Can you take https://gerrit.wikimedia.org/r/#/c/430039/ ? Pleaseee, I need to go for medicine. Please don't mind [13:36:48] (03PS1) 10Filippo Giunchedi: smart: ignore nbd devices [puppet] - 10https://gerrit.wikimedia.org/r/430369 [13:37:04] Jayprakash12345: okay, do I need to do anything special afterwards? testing? [13:37:12] !log gilles@tin Synchronized wmf-config/InitialiseSettings.php: T187299 Add performance perception QuickSurvey definition (duration: 01m 17s) [13:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:17] T187299: User-perceived page load performance study - https://phabricator.wikimedia.org/T187299 [13:37:39] (03PS2) 10Filippo Giunchedi: smart: ignore nbd devices [puppet] - 10https://gerrit.wikimedia.org/r/430369 [13:38:17] (03CR) 10Filippo Giunchedi: [C: 032] smart: ignore nbd devices [puppet] - 10https://gerrit.wikimedia.org/r/430369 (owner: 10Filippo Giunchedi) [13:38:28] RECOVERY - MariaDB Slave Lag: s4 on db1102 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [13:38:29] Hauskatze: No, Legoktm already merged master patch. only need to be here [13:38:36] seems like it changes something on Special:Unblock, but I don't have access to that on any prod wiki [13:39:07] I do [13:39:26] I've never deployed a backport, gotta check if I can make sense of the docs on that topic [13:39:48] can you please give me like 5 minutes? Someone has arrived here and I have to take care of him [13:40:13] !log Repool lvs2001 - T191897 [13:40:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:16] T191897: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897 [13:40:21] ok [13:40:41] 10Operations, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Remove wildcard vhost for *.wikimedia.org - https://phabricator.wikimedia.org/T192206#4174594 (10BBlack) FWIW on my end, the following hostnames are definitely non-functional: ``` text-lb text-lb.codfw text-lb.eqiad text-lb.eqsi... [13:40:47] back [13:40:59] my secretary has arrived [13:41:05] branches seem straightforward, I should be able to do it [13:42:19] (03CR) 10Ottomata: "COOL" [puppet] - 10https://gerrit.wikimedia.org/r/430318 (https://phabricator.wikimedia.org/T164008) (owner: 10Elukey) [13:43:08] (03CR) 10Ottomata: "Ah great, even better! Didn't know about /etc/icinga/puppet_hosts.cfg" [puppet] - 10https://gerrit.wikimedia.org/r/430079 (owner: 10Ottomata) [13:44:28] RECOVERY - DPKG on mw2200 is OK: All packages OK [13:44:37] RECOVERY - configured eth on mw2200 is OK: OK - interfaces up [13:44:38] RECOVERY - MD RAID on mw2200 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [13:44:48] RECOVERY - Check size of conntrack table on mw2200 is OK: OK: nf_conntrack is 0 % full [13:44:58] RECOVERY - Disk space on mw2200 is OK: DISK OK [13:44:58] RECOVERY - dhclient process on mw2200 is OK: PROCS OK: 0 processes with command name dhclient [13:45:08] RECOVERY - High CPU load on API appserver on mw2200 is OK: OK - load average: 2.10, 1.78, 2.02 [13:45:17] RECOVERY - Check whether ferm is active by checking the default input chain on mw2200 is OK: OK ferm input default policy is set [13:45:38] (03PS2) 10Gehel: wdqs: remove PrivateTmp option from wdqs-blazegraph systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/430049 (https://phabricator.wikimedia.org/T192759) [13:45:43] (03PS3) 10Gehel: wdqs: remove PrivateTmp option from wdqs-blazegraph systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/430049 (https://phabricator.wikimedia.org/T192759) [13:46:07] RECOVERY - Device not healthy -SMART- on labtestvirt2003 is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labtestvirt2003&var-datasource=codfw%2520prometheus%252Fops [13:46:40] 10Operations, 10Pybal, 10Traffic, 10netops: Rename lvs* LLDP port descriptions after upgrading to stretch - https://phabricator.wikimedia.org/T192087#4174610 (10Vgutierrez) [13:47:42] !log Update puppet compiler facts [13:47:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:56] (03CR) 10Gehel: [C: 032] wdqs: remove PrivateTmp option from wdqs-blazegraph systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/430049 (https://phabricator.wikimedia.org/T192759) (owner: 10Gehel) [13:48:50] gilles: yes, I think they're easy to do but while I've requested backports myself in the past I've never deployed one so I can't tell what needs to be done later [13:48:57] RECOVERY - puppet last run on mw2200 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:49:08] RECOVERY - Nginx local proxy to apache on mw2200 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 2.747 second response time [13:49:32] !log beginning upgrade of kafka-jumbo brokers from 1.0.0 -> 1.1.0 : T193495 [13:49:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:36] T193495: Upgrade Kafka on jumbo cluster to 1.1.0 (latest) - https://phabricator.wikimedia.org/T193495 [13:52:13] gilles: Jayprakash12345's patch is failing on continuous integration so I don't think we can merge [13:52:43] I don't know how to debug that quibble stuff, it's hashar's field [13:52:53] Hauskatze: have you seen that in jenkins somewhere? still hasn't reported back in gerrit [13:53:20] https://integration.wikimedia.org/zuul/ <-- here [13:53:27] the gate-and-submit-swat one [13:54:07] and it's now on the patch as well [13:54:13] https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-hhvm-docker/690/console [13:54:44] looks like quibble taking a crap on a corrupt sqlite DB [13:54:48] I don't think it's related to the change [13:55:11] 13:43:29 Creating tables [13:55:11] 13:43:30 sqlite3_step() returned SQLITE_CORRUPT. [13:55:37] RECOVERY - DPKG on mw1341 is OK: All packages OK [13:55:41] I don't know t.b.h [13:55:57] (03CR) 10Herron: [C: 032] wmfusercontent.org: add SPF record to disable email [dns] - 10https://gerrit.wikimedia.org/r/430008 (https://phabricator.wikimedia.org/T193408) (owner: 10Dzahn) [13:55:57] RECOVERY - Check whether ferm is active by checking the default input chain on mw1341 is OK: OK ferm input default policy is set [13:56:28] RECOVERY - Check whether ferm is active by checking the default input chain on mw1340 is OK: OK ferm input default policy is set [13:56:47] RECOVERY - configured eth on mw1340 is OK: OK - interfaces up [13:56:54] as the patch is now it is not merged so there's no need to revert anything right? [13:57:00] ok, let's bump this to a future SWAT window, will give releng a chance to fix the job [13:57:07] RECOVERY - DPKG on mw1340 is OK: All packages OK [13:57:09] (03PS4) 10Bodhisattwa: Enable ULS webfonts by default at Bengali Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430360 (https://phabricator.wikimedia.org/T193367) [13:57:19] Hauskatze: yeah, nothing merged, nothing to revert [13:57:23] gilles: sound good to me, you can remove the +2 on the patch though [13:57:28] yeah [13:57:54] Jayprakash12345: for when you're back, we couldn't merge your backport because it fails on jenkins [13:57:59] sorry! [13:58:17] !log End of mid-day EU SWAT [13:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:42] RECOVERY - Nginx local proxy to apache on mw1340 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 4.406 second response time [13:59:12] RECOVERY - Check systemd state on mw2200 is OK: OK - running: The system is fully operational [13:59:43] RECOVERY - MD RAID on mw1342 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [14:00:45] (03PS1) 10Jcrespo: mariadb: Pool new vslow,dump host on s4 (db1121), move db1064 to x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430377 (https://phabricator.wikimedia.org/T192979) [14:01:33] 10Operations, 10ops-eqiad, 10Cloud-VPS: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4174676 (10chasemp) [14:01:43] RECOVERY - nutcracker process on mw2200 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [14:02:03] PROBLEM - mediawiki-installation DSH group on mw1250 is CRITICAL: Host mw1250 is not in mediawiki-installation dsh group [14:02:03] PROBLEM - Disk space on mw1250 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:02:03] PROBLEM - Check systemd state on mw1254 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:02:03] PROBLEM - Check systemd state on mw1255 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:02:22] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1341 is OK: OK: synced at Wed 2018-05-02 14:02:13 UTC. [14:03:39] (03PS3) 10Herron: wmfusercontent.org: add SPF record to disable email [dns] - 10https://gerrit.wikimedia.org/r/430008 (https://phabricator.wikimedia.org/T193408) (owner: 10Dzahn) [14:03:42] PROBLEM - Nginx local proxy to apache on mw1254 is CRITICAL: connect to address 10.64.48.89 and port 443: Connection refused [14:03:43] PROBLEM - Nginx local proxy to apache on mw1255 is CRITICAL: connect to address 10.64.48.90 and port 443: Connection refused [14:03:43] PROBLEM - nutcracker port on mw1250 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:03:43] PROBLEM - HHVM processes on mw1250 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:03:43] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1254 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:03:43] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1255 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:03:51] 10Operations, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Remove wildcard vhost for *.wikimedia.org - https://phabricator.wikimedia.org/T192206#4174688 (10Jgreen) >>! In T192206#4174594, @BBlack wrote: > FWIW on my end, the following hostnames are definitely non-functional: > ``` > tex... [14:04:05] (03PS1) 10Vgutierrez: hieradata: clean-up codfw lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/430381 (https://phabricator.wikimedia.org/T191897) [14:05:43] RECOVERY - Check systemd state on mw1342 is OK: OK - running: The system is fully operational [14:06:03] RECOVERY - Check whether ferm is active by checking the default input chain on mw1342 is OK: OK ferm input default policy is set [14:07:24] (03PS1) 10Filippo Giunchedi: smart: exclude labnet1002 from checks, mpt controller [puppet] - 10https://gerrit.wikimedia.org/r/430382 [14:08:53] (03CR) 10Vgutierrez: [C: 032] "pcc is happy and shows no op: https://puppet-compiler.wmflabs.org/compiler02/11099/" [puppet] - 10https://gerrit.wikimedia.org/r/430381 (https://phabricator.wikimedia.org/T191897) (owner: 10Vgutierrez) [14:09:03] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: 10.55 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [14:09:44] (03PS2) 10Filippo Giunchedi: smart: exclude labnet1002 from checks, mpt controller [puppet] - 10https://gerrit.wikimedia.org/r/430382 [14:09:46] !log restarting blazegraph to deactivate PrivateTmp [14:09:57] (03CR) 10jenkins-bot: Fix stray space in quicksurvey configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430363 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [14:10:22] RECOVERY - Check the NTP synchronisation status of timesyncd on mw2200 is OK: OK: synced at Wed 2018-05-02 14:10:14 UTC. [14:10:23] (03CR) 10Jcrespo: [C: 032] mariadb: Add new node db1121 to mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430368 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [14:10:26] (03CR) 10Filippo Giunchedi: [C: 032] smart: exclude labnet1002 from checks, mpt controller [puppet] - 10https://gerrit.wikimedia.org/r/430382 (owner: 10Filippo Giunchedi) [14:10:30] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Upgrade LVS servers to stretch - https://phabricator.wikimedia.org/T177961#4174710 (10Vgutierrez) [14:10:34] 10Operations, 10Pybal, 10Traffic: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4174706 (10Vgutierrez) 05Open>03Resolved a:03Vgutierrez [14:11:03] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: 10.55 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [14:11:36] (03Merged) 10jenkins-bot: mariadb: Add new node db1121 to mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430368 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [14:11:42] RECOVERY - Nginx local proxy to apache on mw1342 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.047 second response time [14:12:22] ah, i will silence ^^^ [14:13:22] RECOVERY - Nginx local proxy to apache on mw1341 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.053 second response time [14:13:38] (03CR) 10Jcrespo: [C: 032] mariadb: Pool new vslow,dump host on s4 (db1121), move db1064 to x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430377 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [14:13:43] (03PS2) 10Jcrespo: mariadb: Pool new vslow,dump host on s4 (db1121), move db1064 to x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430377 (https://phabricator.wikimedia.org/T192979) [14:13:55] (03PS1) 10Addshore: Update comment next to "WMDE" wmgMonologChannels entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430384 (https://phabricator.wikimedia.org/T191500) [14:16:10] jynus: I'm going to sync 3 patches in a second, just pinging you as your clearly touching mediawiki-config a bit right now, and one of the patches is a comment fix in there [14:17:13] (03PS2) 10Addshore: Update comment next to "WMDE" wmgMonologChannels entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430384 (https://phabricator.wikimedia.org/T191500) [14:17:21] (03PS1) 10Jcrespo: dbhosts: Add db1102 into s4, move db1054 to x1 [software] - 10https://gerrit.wikimedia.org/r/430385 (https://phabricator.wikimedia.org/T192979) [14:18:02] ok, I will wait [14:18:14] as long as you dont touch the db-*.php files [14:18:20] I wont be :) [14:18:26] (03CR) 10Addshore: [C: 032] Update comment next to "WMDE" wmgMonologChannels entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430384 (https://phabricator.wikimedia.org/T191500) (owner: 10Addshore) [14:18:49] (03CR) 10jenkins-bot: mariadb: Add new node db1121 to mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430368 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [14:18:52] RECOVERY - Device not healthy -SMART- on labnet1002 is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labnet1002&var-datasource=eqiad%2520prometheus%252Fops [14:19:41] (03Merged) 10jenkins-bot: Update comment next to "WMDE" wmgMonologChannels entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430384 (https://phabricator.wikimedia.org/T191500) (owner: 10Addshore) [14:20:33] jynus: I see my fetch pulled in "mariadb: Add new node db1121 to mediawiki" too, I'll rebase that and my change but will not sync yours [14:22:48] your fetch didn't pull that, I was in the middle of a deploy when you interrupted me! [14:22:57] :-) [14:23:15] Sorry, I would have let you continue had I realised you hadn't run the sync yet! [14:23:38] please go on fast [14:23:46] already syncing mine [14:23:48] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:430384|Update comment next to WMDE wmgMonologChannels entry]] T191500 (duration: 01m 17s) [14:23:51] done! [14:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:52] T191500: deploy patch & logging for tracking user registrations - https://phabricator.wikimedia.org/T191500 [14:23:54] jynus: all yours! [14:25:08] (03PS3) 10Jcrespo: mariadb: Pool new vslow,dump host on s4 (db1121), move db1064 to x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430377 (https://phabricator.wikimedia.org/T192979) [14:25:36] (03CR) 10jenkins-bot: Update comment next to "WMDE" wmgMonologChannels entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430384 (https://phabricator.wikimedia.org/T191500) (owner: 10Addshore) [14:31:08] (03PS6) 10Mark Bergsma: Add unit tests for ProxyFetchMonitoringProtocol [debs/pybal] - 10https://gerrit.wikimedia.org/r/430339 [14:32:01] 10Operations, 10ops-eqiad, 10Cloud-VPS: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579#4174911 (10Cmjohnson) @Andrew @chasemp One other thing we should do here is move labnet1002 to the new switch. Can we do this on May 15? 1500UTC/1000 EST [14:32:38] (03CR) 10jenkins-bot: mariadb: Pool new vslow,dump host on s4 (db1121), move db1064 to x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430377 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [14:33:25] !log jynus@tin Synchronized wmf-config/db-codfw.php: Add db1121 (duration: 01m 16s) [14:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:30] 10Operations, 10ops-eqiad, 10Cloud-VPS: labnet1003 and labnet1004 moving and enabling 10G NICs - https://phabricator.wikimedia.org/T193196#4174919 (10Cmjohnson) @Andrew @chasemp I am on vacation 5/11 so let's plan for 5/15 I am available anytime [14:34:00] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1342 is OK: OK: synced at Wed 2018-05-02 14:33:52 UTC. [14:35:20] Would anyone be able to merge https://gerrit.wikimedia.org/r/#/c/429252/ for me? -- makes the coal processor run on a perf team machine, instead of on graphite :-D [14:35:30] (03PS13) 10Imarlier: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774) [14:35:54] (03PS2) 10Vgutierrez: Rename lvs[2001-2006] interface dependent hostnames [dns] - 10https://gerrit.wikimedia.org/r/428888 (https://phabricator.wikimedia.org/T191897) [14:37:38] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Add and pool db1121 (duration: 01m 17s) [14:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:15] (03CR) 10Jcrespo: [C: 032] dbhosts: Add db1102 into s4, move db1054 to x1 [software] - 10https://gerrit.wikimedia.org/r/430385 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [14:40:04] (03PS1) 10Gabriel Birke: Enable AdvancedSearch BetaFeature on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430388 (https://phabricator.wikimedia.org/T193182) [14:40:12] (03Merged) 10jenkins-bot: dbhosts: Add db1102 into s4, move db1054 to x1 [software] - 10https://gerrit.wikimedia.org/r/430385 (https://phabricator.wikimedia.org/T192979) (owner: 10Jcrespo) [14:41:04] (03PS1) 10Urbanecm: Add images.rkd.nl to copy upload whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430390 (https://phabricator.wikimedia.org/T193639) [14:43:39] 10Operations, 10ops-eqiad: Broken memory/CPU on mw1275 - https://phabricator.wikimedia.org/T192902#4175019 (10Cmjohnson) I swapped the DIMM with A1, cleared SEL and powered back on. Let's see if the error returns and/or moves. [14:43:40] RECOVERY - Host mw1275 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [14:43:58] 10Operations, 10ops-eqiad, 10Cloud-VPS: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579#4175026 (10Andrew) @cmjohnson, The 15th sounds good for labnet1001. I don't think 1500UTC is the same thing as 1000 EST but I'm going to assume that the EST part is what interests you :) Labn... [14:44:30] (03PS1) 10Jcrespo: mariadb: Repool db1098 after being checked for consistancy issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430391 (https://phabricator.wikimedia.org/T193331) [14:46:08] 10Operations, 10ops-eqiad, 10Cloud-VPS: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579#4175040 (10chasemp) >>! In T193579#4174911, @Cmjohnson wrote: > @Andrew @chasemp One other thing we should do here is move labnet1002 to the new switch. > > Can we do this on May 15? 1500UTC/1... [14:47:01] RECOVERY - configured eth on mw1275 is OK: OK - interfaces up [14:47:51] RECOVERY - Check whether ferm is active by checking the default input chain on mw1275 is OK: OK ferm input default policy is set [14:48:24] 10Operations, 10ops-eqiad, 10Cloud-VPS: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579#4175056 (10Andrew) [14:48:42] 10Operations, 10Performance-Team: Migrate webperf from hafnium to webperf1001 - https://phabricator.wikimedia.org/T186774#4175063 (10Imarlier) @Operations - would love to get a merge on https://gerrit.wikimedia.org/r/#/c/429252/ when someone gets a chance. [14:49:09] 10Operations, 10ops-eqiad, 10Cloud-VPS: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579#4172971 (10Andrew) [14:56:03] 10Operations, 10Puppet, 10Release-Engineering-Team, 10puppet-compiler, 10Continuous-Integration-Config: Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532#4175081 (10RobH) >>! In T192532#4173521, @Joe wrote: > The compliler has little to do with @... [14:56:24] (03PS1) 10Mark Bergsma: Handle HTTP status 302 and 303 as well as 301 [debs/pybal] - 10https://gerrit.wikimedia.org/r/430393 (https://phabricator.wikimedia.org/T102393) [14:58:36] Hi ops-team - Just to let you know we are going to deploy a new refinery version (hadoop jobs) [14:58:55] !log joal@tin Started deploy [analytics/refinery@318d449]: Regular weekly deploy [14:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:52] (03PS5) 10Eevans: cassandra: increase `vm.max_map_count` to 1048575 [puppet] - 10https://gerrit.wikimedia.org/r/429101 (https://phabricator.wikimedia.org/T193083) [15:01:51] 10Operations, 10Cassandra, 10hardware-requests, 10Services (blocked), and 2 others: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4175102 (10Cmjohnson) [15:03:13] * elukey sees urandom's patch [15:03:20] :P [15:04:43] urandom: ready to deploy? [15:04:57] elukey: sure, if you want [15:05:12] elukey: i've already set this on the restbase cluster a week or more ago [15:05:26] elukey: so yours is the only one that will change :) [15:05:27] (03CR) 10Elukey: [C: 032] cassandra: increase `vm.max_map_count` to 1048575 [puppet] - 10https://gerrit.wikimedia.org/r/429101 (https://phabricator.wikimedia.org/T193083) (owner: 10Eevans) [15:05:41] poor aqs [15:05:50] naw, it'll be fine [15:05:57] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1098 after being checked for consistancy issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430391 (https://phabricator.wikimedia.org/T193331) (owner: 10Jcrespo) [15:06:17] after a while, working on aqs on fire [15:06:44] merged! [15:06:55] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is OK: (C)10 ge (W)5 ge 2.917 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [15:06:56] 10Operations, 10Cassandra, 10Services (blocked), 10User-Eevans, 10User-fgiunchedi: Replace 5 Samsung SSD 850 devices w/ 4 1.6T Intel or HP SSDs - https://phabricator.wikimedia.org/T189822#4175110 (10RobH) [15:07:13] (03Merged) 10jenkins-bot: mariadb: Repool db1098 after being checked for consistancy issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430391 (https://phabricator.wikimedia.org/T193331) (owner: 10Jcrespo) [15:07:34] elukey: thanks! [15:07:41] !log joal@tin Finished deploy [analytics/refinery@318d449]: Regular weekly deploy (duration: 08m 46s) [15:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:25] (03PS1) 10Gehel: wdqs: increase TasksMax to 10000 for blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/430394 (https://phabricator.wikimedia.org/T192759) [15:10:25] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:10:50] (03CR) 10jenkins-bot: mariadb: Repool db1098 after being checked for consistancy issues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430391 (https://phabricator.wikimedia.org/T193331) (owner: 10Jcrespo) [15:11:25] RECOVERY - etcd request latencies on neon is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:13:58] 10Operations, 10Discovery-Search: migrate elasticsearch to stretch - https://phabricator.wikimedia.org/T193649#4175130 (10Gehel) [15:15:19] (03PS1) 10Hoo man: Wikidata entity dumps: Move generic parts into functions [puppet] - 10https://gerrit.wikimedia.org/r/430395 (https://phabricator.wikimedia.org/T190513) [15:17:17] (03PS1) 10Chad: group1 to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430397 [15:18:05] (03CR) 10Chad: [C: 04-2] "later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430397 (owner: 10Chad) [15:18:27] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4175144 (10jcrespo) No consistency issues found on s6 and s7, repooling. [15:18:47] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1098 (duration: 01m 17s) [15:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:47] 10Operations, 10ops-eqiad, 10Cloud-VPS: Update and move labnet1001/1002 - https://phabricator.wikimedia.org/T193579#4175145 (10Andrew) [15:20:24] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/430394 (https://phabricator.wikimedia.org/T192759) (owner: 10Gehel) [15:26:35] 10Operations, 10Puppet, 10Release-Engineering-Team, 10puppet-compiler, 10Continuous-Integration-Config: Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532#4142170 (10thcipriani) > To run the puppet compiler, one needs the 'Job/Build' permission in... [15:26:54] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4175165 (10jcrespo) @robh the specific incident for this host has been taken care, should we centralize the recurring issue into a separate task? If yes, I would close this as r... [15:27:04] (03PS1) 10Ottomata: kafka jumbo 1.1.0 inter.broker.protocol.version [puppet] - 10https://gerrit.wikimedia.org/r/430398 (https://phabricator.wikimedia.org/T193495) [15:27:27] (03PS2) 10Ottomata: kafka jumbo 1.1.0 inter.broker.protocol.version [puppet] - 10https://gerrit.wikimedia.org/r/430398 (https://phabricator.wikimedia.org/T193495) [15:27:34] (03CR) 10Ottomata: [V: 032 C: 032] kafka jumbo 1.1.0 inter.broker.protocol.version [puppet] - 10https://gerrit.wikimedia.org/r/430398 (https://phabricator.wikimedia.org/T193495) (owner: 10Ottomata) [15:28:06] (03CR) 10Hoo man: "Tested with testwikidata" [puppet] - 10https://gerrit.wikimedia.org/r/430395 (https://phabricator.wikimedia.org/T190513) (owner: 10Hoo man) [15:29:51] 10Operations, 10ops-eqiad, 10cloud-services-team: labstore1003 SMART failure - https://phabricator.wikimedia.org/T193651#4175173 (10fgiunchedi) [15:29:52] (03PS1) 10Elukey: role::prometheus::analytics: rename cassandra metrics/labels [puppet] - 10https://gerrit.wikimedia.org/r/430399 (https://phabricator.wikimedia.org/T193017) [15:29:56] (03PS1) 10Papaul: DNS: Remove mgmt DNS for db2011 [dns] - 10https://gerrit.wikimedia.org/r/430400 (https://phabricator.wikimedia.org/T187886) [15:31:31] (03PS2) 10Elukey: role::prometheus::analytics: rename cassandra metrics/labels [puppet] - 10https://gerrit.wikimedia.org/r/430399 (https://phabricator.wikimedia.org/T193017) [15:37:16] ACKNOWLEDGEMENT - Device not healthy -SMART- on labstore1003 is CRITICAL: cluster=labsnfs device=megaraid,31 instance=labstore1003:9100 job=node site=eqiad Filippo Giunchedi T193651 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labstore1003&var-datasource=eqiad%2520prometheus%252Fops [15:44:31] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4175204 (10jcrespo) p:05High>03Normal [15:44:33] (03CR) 10Anomie: "The yaml file looks sane. I haven't looked too closely at the python file." [puppet] - 10https://gerrit.wikimedia.org/r/430024 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [15:46:23] ottomata: Heya, could you help which one the canonical one is to be graphed? https://grafana-admin.wikimedia.org/dashboard/db/eventlogging-schema?from=now-3h&to=now&orgId=1 [15:46:39] E.g. did it change temporarily, and we need to graph multiple, or should we just display one of them only? [15:48:58] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: db2081 crashed/rebooted, probably due to hardware failure - https://phabricator.wikimedia.org/T193325#4175216 (10jcrespo) I am leaving a check ongoing on wikidatawiki on some codfw hosts to proof no data was lost. [15:49:10] (03PS1) 10Vgutierrez: lvs10[13-16] production DNS entries, all vlans [dns] - 10https://gerrit.wikimedia.org/r/430402 (https://phabricator.wikimedia.org/T184293) [15:50:15] RECOVERY - Disk space on mw1250 is OK: DISK OK [15:50:34] ah Krinkle it should be all of them. the topic only has one partition, so at any given time there is only one leader [15:50:40] what you are seeing is leader rebalances [15:50:45] because i'm restarting brokers rigiht now [15:50:52] if it had more partitions, like webrequest_text [15:50:55] RECOVERY - HHVM processes on mw1250 is OK: PROCS OK: 6 processes with command name hhvm [15:50:55] you'd need to sum all partitions [15:51:29] ottomata: Hm.. not sure how to graph that then. [15:51:47] should be the same as this [15:51:47] https://grafana-admin.wikimedia.org/dashboard/db/kafka-by-topic?refresh=5m&orgId=1 [15:52:07] We can probably pick one kafka_cluster though? [15:52:19] kafka_cluster? yes [15:52:29] for eventlogging it is always jumbo-eqiad [15:52:38] ah so we can remove that from display [15:53:15] Yeah, the the other one isn't usually there, but popped up a bit earlier for a few minutes [15:53:40] editing, i'm also hiding the prometheus datasoruce variable, it wil never change for this dash [15:53:50] cool [15:54:10] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests, 10User-Urbanecm: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830#4175230 (10RobH) 05Open>03declined discussion with user off task. [15:54:28] When we display 001 and 003, I'd expect them to add up if it's a switch, but it seems they can actually both have the same messages (which I know Kafka handles fine, but less nice for the graph). Or are those rounding errors from timestamps? [15:54:29] hmm i can sum this too ig uess [15:54:42] i'm not srue, but i'd guess it is just a metric blip, yeah [15:54:59] Hm. actually, I think the spike makes sense given it was down for a minute, right? [15:55:01] So it's catching up [15:55:12] it's processing time, not schema time, right? [15:56:06] they could be duplicate events, true. this is rate of number of messages in a topic reported by each broker [15:56:12] the overlap is from when the leader changes [15:56:14] due to a restart [15:56:32] it is possible the producer will re-produce duplicates then [15:56:40] Right [15:56:58] or, it could be a metric blip to? the irate is over 5 minutes? [15:57:01] so might be an overlap [15:57:21] Krinkle: would that graph be more useful if I summed out the broker hostname? [15:57:34] ottomata: I think so yeah. [15:57:34] How strictly is the long running scripts require a deployment window interpreted? I want to run recountCategories.php on huwiki (due to community request). My rough estimate is I expect task to take somewhere between 20 min to an hour, but I'm not really sure. Is it cool if I just run it, or do I need to put this in the deploy calendar [15:57:39] probably not, right? its good to see the leader flip when this happens so we can be sure it is due to flip? [15:57:45] Also because we need it down below to do a division [15:58:06] Yeah, that's also true :) [16:01:21] RECOVERY - nutcracker port on mw1250 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [16:01:21] RECOVERY - Nginx local proxy to apache on mw1255 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.065 second response time [16:01:30] RECOVERY - Nginx local proxy to apache on mw1254 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 1.958 second response time [16:01:51] RECOVERY - Check systemd state on mw1255 is OK: OK - running: The system is fully operational [16:02:20] RECOVERY - Check systemd state on mw1254 is OK: OK - running: The system is fully operational [16:03:50] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1255 is OK: OK: synced at Wed 2018-05-02 16:03:42 UTC. [16:03:50] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1254 is OK: OK: synced at Wed 2018-05-02 16:03:42 UTC. [16:04:32] (03CR) 10Ottomata: "Actually, I did know about that file oh yeah! Great idea." [puppet] - 10https://gerrit.wikimedia.org/r/430079 (owner: 10Ottomata) [16:05:12] ottomata: btw, if you set "null as zero" at https://grafana-admin.wikimedia.org/dashboard/db/kafka-by-topic?refresh=5m&orgId=1&panelId=6&fullscreen&edit&tab=display you'll find it renders much better :) [16:05:28] (03PS3) 10Ottomata: icinga-downtime - fail if given FQDN [puppet] - 10https://gerrit.wikimedia.org/r/430079 [16:07:33] oh great Krinkle thanks [16:09:41] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1965 bytes in 0.095 second response time [16:09:51] (03CR) 10Bstorm: "I was thinking the same thing. Parsing fragments like this is not very flexible without becoming incredibly complex. I might try somethi" [puppet] - 10https://gerrit.wikimedia.org/r/430024 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [16:12:41] how come id.wikimedia.org now redirects to id.wikipedia.org? [16:12:53] it was supposed to be a new wiki [16:13:10] and yesterday it was still just showing the standard page when virtual host exists but no wiki [16:13:54] Urbanecm: ^ do you know? [16:15:44] mutante: It seems it's not a varnish or apache redirect. id.wikimedia.org is somehow reaching MediaWiki PHP and is interpreting it as idwiki=id.wikipedia, and then just redirecting to the canonical location of /wiki/Main_Page, which happens to be on a different domain than the current one. [16:15:59] Seems like something that would be caused by a problem in chapter.dblist and/or multiversion parsing of hostnames. [16:16:15] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552#4175273 (10ayounsi) [16:16:30] You can tell by the response headers for https://id.wikimedia.org?sdfdsf containing wgLogo preload links for idwiki.png [16:16:39] as well as the p3p header [16:19:25] marlier: i can help merge that patch after our standup in 10-20 mins [16:19:29] looks fine to me :) [16:19:32] Krinkle: i see! sounds like it will be cleared up once addWiki.php has actually run [16:19:33] 10Operations, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Remove wildcard vhost for *.wikimedia.org - https://phabricator.wikimedia.org/T192206#4175277 (10BBlack) @Jgreen yeah if you don't have any special purpose for them, then they're basically the same as the text-lb ones (we like h... [16:19:39] thanks Krinkle [16:19:50] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1947 bytes in 0.097 second response time [16:20:20] mutante: to confirm, the older wiki is fine, but the new one is not yet working (redirecting wrongly), correct? [16:20:58] Krinkle: yes, the new one just hasnt been created yet. i have merged the Apache change though that adds the ServerAlias for it [16:21:07] k, yeah. [16:21:08] the old one is fine [16:21:20] just that yesterday i didnt see the redirect yet [16:21:25] This is courtesy of the legacy that db suffix "wiki" is both for wikipedias and for special wikimedia.org wikis [16:21:40] so unless it knows about the special case, it assumes a wikipedia for suffix 'wiki' [16:21:50] all other domains have their own suffix, so this doesn't happen. [16:22:01] It's still weird though. It should be able to tell that wikimedia.org cannot be wikipedia [16:22:14] gotcha. makes sense. special wikis should be called "swiki" or something :p [16:22:39] well.. chapter wikis [16:22:42] But it's currently mapping in the other direction. E.g. it's not saying wikimedia.org is wikipedia, it's saying wikimedia.org is suffix "wiki", which is correct. But there is no config for it, so it's getting idwiki config, which in turn says it's wikipedia. [16:22:59] for chapter wikis we use the suffix 'wikimedia' [16:23:05] No conflict there. [16:23:26] The conflict is with special wikimedia.org wikis (chapters aren't special, sorry :P ) [16:23:31] e.g. commonswiki metawiki [16:23:46] It would not be feasible to have a non-language code wiki under wikipedia.org right now. [16:23:56] ok, but once this is created it will be "idwikimedia" then [16:24:06] Yeah, that should work fine. [16:24:11] ok [16:24:42] (03PS2) 10Gehel: wdqs: increase TasksMax to 10000 for blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/430394 (https://phabricator.wikimedia.org/T192759) [16:25:41] (03CR) 10Gehel: [C: 032] wdqs: increase TasksMax to 10000 for blazegraph [puppet] - 10https://gerrit.wikimedia.org/r/430394 (https://phabricator.wikimedia.org/T192759) (owner: 10Gehel) [16:28:20] !log restarting blazegraph to increase TasksMax [16:28:34] (03CR) 10BBlack: [C: 031] "Looks like correct mapping, best as I can tell manually!" [dns] - 10https://gerrit.wikimedia.org/r/430402 (https://phabricator.wikimedia.org/T184293) (owner: 10Vgutierrez) [16:29:06] gehel: FYI that !log didn't stick :( [16:29:13] !log restarting blazegraph to increase TasksMax [16:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:24] godog: thanks! there was a leading space... [16:29:37] indeed [16:34:33] Alrighty I'm going to run recountCategories.php for T169964 [16:34:33] T169964: Counter of the numbers of the pages on a category shows negative result - https://phabricator.wikimedia.org/T169964 [16:35:00] 10Operations, 10Puppet, 10Release-Engineering-Team, 10puppet-compiler, 10Continuous-Integration-Config: Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532#4175323 (10RobH) Correct, but it seems we'll need to add another ldap group, since we don't... [16:35:55] !log run recountCategories.php on huwiki T169964 [16:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:40] 10Operations, 10Cloud-VPS, 10Patch-For-Review: package prometheus-rabbitmq-exporter for Debian jessie - https://phabricator.wikimedia.org/T188392#4175334 (10aborrero) Package was added to `jessie-wikimedia` [16:37:43] RECOVERY - nova-compute proc maximum on labtestvirt2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [16:37:43] PROBLEM - Check systemd state on labtestneutron2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:38:03] RECOVERY - nova-compute proc minimum on labtestvirt2002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n] /usr/bin/nova-compute [16:39:44] (03PS1) 10Arturo Borrero Gonzalez: d/changelog: generate entry for jessie-wikimedia [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/430408 [16:40:26] Wow, that script is done. I really thought it'd take like half an hour not 30 seconds [16:41:25] (03PS1) 10Marostegui: s3.hosts: Add db1116:3313 [software] - 10https://gerrit.wikimedia.org/r/430410 [16:41:46] (03PS1) 10Arturo Borrero Gonzalez: openstack: rabbitmq_exporter package was added to jessie-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/430411 (https://phabricator.wikimedia.org/T188392) [16:41:52] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] d/changelog: generate entry for jessie-wikimedia [debs/prometheus-rabbitmq-exporter] - 10https://gerrit.wikimedia.org/r/430408 (owner: 10Arturo Borrero Gonzalez) [16:44:04] (03PS1) 10Muehlenhoff: Move scap proxy in B3 to mw2255 [puppet] - 10https://gerrit.wikimedia.org/r/430412 [16:44:49] 10Operations, 10Puppet, 10Release-Engineering-Team, 10puppet-compiler, 10Continuous-Integration-Config: Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532#4142170 (10Dzahn) fwiw, we already have a listed of "trusted devs" somewhere in integration/... [16:47:43] 10Operations, 10Puppet, 10Release-Engineering-Team, 10puppet-compiler, 10Continuous-Integration-Config: Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532#4175362 (10Paladox) ^^ that will work zuul side, but this requires users to have permissions... [16:51:03] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1098 crashed and got rebooted - https://phabricator.wikimedia.org/T193331#4175365 (10Marostegui) I vote for closing this and if it happens again on any other host. Open a general task and a case with the vendor. db1100 crashed half a year ago and ne... [16:53:30] (03CR) 10Muehlenhoff: [C: 032] Move scap proxy in B3 to mw2255 [puppet] - 10https://gerrit.wikimedia.org/r/430412 (owner: 10Muehlenhoff) [16:56:53] ottomata: That would be great, if you don't mind -- really any time is fine. [16:57:16] marlier: ah lemme make some lunch...but can do in a bit if you are still around [16:57:24] I'll be here. [16:57:38] Whenever works! [17:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180502T1700). [17:00:05] subbu, jdlrobson, RoanKattouw, and framawiki: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:53] o/ [17:01:40] o/ i am in a meeting and be available in 30 mins. [17:01:49] so, other's patches can be swatted before mine. [17:02:02] ACKNOWLEDGEMENT - configured eth on labtestvirt2002 is CRITICAL: eth1 reporting no carrier. andrew bogott T193653 [17:02:03] I'm here but can't do the SWAT myself [17:02:07] RECOVERY - mediawiki-installation DSH group on mw1250 is OK: OK [17:04:05] Or, well, I guess I can if nobody else will do it, it'll just be a bit slower [17:06:17] (03PS2) 10Catrope: Enable wgCiteResponsiveReferences on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430075 (https://phabricator.wikimedia.org/T193491) (owner: 10Framawiki) [17:06:21] (03CR) 10Catrope: [C: 032] Enable wgCiteResponsiveReferences on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430075 (https://phabricator.wikimedia.org/T193491) (owner: 10Framawiki) [17:06:24] 10Operations, 10Cloud-VPS, 10cloud-services-team: rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4175441 (10RobH) p:05Triage>03Normal [17:07:03] 10Operations, 10Cloud-VPS, 10cloud-services-team: rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4175457 (10RobH) Please note that @chasemp wants to review all of the above (specifically the labstore1003 replacement and racking redundancy plan) with @Andrew before... [17:08:29] (03Merged) 10jenkins-bot: Enable wgCiteResponsiveReferences on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430075 (https://phabricator.wikimedia.org/T193491) (owner: 10Framawiki) [17:08:46] (03CR) 10jenkins-bot: Enable wgCiteResponsiveReferences on kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430075 (https://phabricator.wikimedia.org/T193491) (owner: 10Framawiki) [17:09:33] framawiki: Your patch is on mwdebug1002, please test [17:10:50] RoanKattouw: looks good [17:12:46] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable $wgCiteResponsiveReferences on kowiki (T193491) (duration: 01m 17s) [17:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:51] T193491: Convert reference lists over to `responsive` on kowiki - https://phabricator.wikimedia.org/T193491 [17:15:49] jdlrobson: Your patch is on mwdebug1002, please test [17:16:55] RoanKattouw: good on live too for mine, thanks! [17:23:23] jdlrobson: Ping again [17:24:13] RoanKattouw: here [17:24:18] sorry :) [17:24:30] jdlrobson: Your patch is on mwdebug1002, please test [17:24:43] (03PS2) 10Catrope: Enable RemexHtml on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427182 (https://phabricator.wikimedia.org/T192386) (owner: 10Subramanya Sastry) [17:24:51] (03CR) 10Catrope: [C: 032] Enable RemexHtml on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427182 (https://phabricator.wikimedia.org/T192386) (owner: 10Subramanya Sastry) [17:25:09] RoanKattouw: you can sync that's good to go [17:25:22] verified on mediawiki.org that it fixes the issue [17:26:15] (03Merged) 10jenkins-bot: Enable RemexHtml on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427182 (https://phabricator.wikimedia.org/T192386) (owner: 10Subramanya Sastry) [17:29:12] !log catrope@tin Synchronized php-1.32.0-wmf.2/extensions/Kartographer/: Add missing util dependency (duration: 01m 14s) [17:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:14] (03PS1) 10Framawiki: Create the 'eventcoordinator' user group on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430418 (https://phabricator.wikimedia.org/T193075) [17:31:55] !log catrope@tin Synchronized php-1.32.0-wmf.2/extensions/MobileFrontend/: T193564 (duration: 01m 20s) [17:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:59] T193564: Regression: When searching all pages appear as watched - https://phabricator.wikimedia.org/T193564 [17:33:10] marlier: ok let's merge it [17:34:00] ottomata: Whenever you're ready [17:34:11] (03CR) 10Ottomata: [C: 032] Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [17:34:15] (03PS14) 10Ottomata: Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [17:34:17] (03CR) 10Ottomata: [V: 032 C: 032] Make webperf role install coal things [puppet] - 10https://gerrit.wikimedia.org/r/429252 (https://phabricator.wikimedia.org/T186774) (owner: 10Imarlier) [17:34:38] marlier: shall I sun puppet on webperf1001? [17:34:43] there are some clean up steps after, correct? [17:34:59] or...what hosts hsould I run this on? [17:35:07] RoanKattouw, only merged, not on mwdebug1002 right? [17:35:16] The cleanup steps can happen any time [17:35:17] Yes I'm running a bit behind, sorry [17:35:29] ottomata: I can run puppet, so no worries. [17:35:47] ok [17:35:53] RoanKattouw, i see you are heavily multi-tasking ;) [17:35:54] ok marlier let me know if you need anything else then? [17:36:00] Nope, should be fine! [17:36:05] Thank you, though. [17:36:06] oh that's it! ok great! :) [17:36:07] yw [17:37:48] PROBLEM - Device not healthy -SMART- on mw1230 is CRITICAL: cluster=api_appserver device=sda instance=mw1230:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mw1230&var-datasource=eqiad%2520prometheus%252Fops [17:39:36] subbu: On mwdebug1002 now, pleas etest [17:39:46] ok. [17:40:50] !log imarlier@tin Started deploy [performance/coal@bd7568a]: deploy coal to webperf1001 [17:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:56] !log imarlier@tin Finished deploy [performance/coal@bd7568a]: deploy coal to webperf1001 (duration: 00m 06s) [17:40:57] PROBLEM - puppet last run on graphite2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): User[coal],Group[coal] [17:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:04] RoanKattouw, some pages have expected broken rendering .. so, good to go. [17:42:37] PROBLEM - Check systemd state on webperf1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:43:24] 10Operations, 10ops-codfw, 10cloud-services-team: labtestvirt2002 eth1 showing no carrier - https://phabricator.wikimedia.org/T193653#4175585 (10RobH) [17:44:06] 10Operations, 10Cloud-VPS, 10Patch-For-Review: package prometheus-rabbitmq-exporter for Debian jessie - https://phabricator.wikimedia.org/T188392#4175590 (10chasemp) fyi on labtestneutron2001 atm ```root@labtestneutron2001:~# systemctl list-units --state=failed UNIT LOAD... [17:44:37] PROBLEM - puppet last run on webperf2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[performance/coal] [17:44:47] RECOVERY - Check systemd state on webperf1001 is OK: OK - running: The system is fully operational [17:45:37] PROBLEM - Check systemd state on webperf2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:46:17] RECOVERY - configured eth on labtestvirt2002 is OK: OK - interfaces up [17:46:54] 10Operations, 10ops-codfw, 10cloud-services-team: labtestvirt2002 eth1 showing no carrier - https://phabricator.wikimedia.org/T193653#4175598 (10RobH) 05Open>03Resolved a:03RobH looks like it wasn't enabled on the switch, but is already setup for use. I enabled it: ``` robh@asw-b-codfw> show interfa... [17:47:02] (03PS1) 10Imarlier: coal: require python-tz [puppet] - 10https://gerrit.wikimedia.org/r/430421 (https://phabricator.wikimedia.org/T186774) [17:48:02] RoanKattouw: Sync? [17:48:26] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable RemexHtml on metawiki (T192386) (duration: 01m 17s) [17:48:28] PROBLEM - Hadoop NodeManager on analytics1041 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [17:48:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:34] T192386: Enable RemexHTML on metawiki - https://phabricator.wikimedia.org/T192386 [17:48:57] PROBLEM - Hadoop NodeManager on analytics1042 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [17:49:02] (03PS2) 10Catrope: Enable RemexHtml on wikis with <100 issues in high-priority linter cats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429357 (https://phabricator.wikimedia.org/T192299) (owner: 10Subramanya Sastry) [17:49:06] (03CR) 10Catrope: [C: 032] Enable RemexHtml on wikis with <100 issues in high-priority linter cats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429357 (https://phabricator.wikimedia.org/T192299) (owner: 10Subramanya Sastry) [17:49:21] Hadoop nodemanager was me [17:49:23] 10Operations, 10ops-eqiad, 10cloud-services-team: labstore1003 SMART failure - https://phabricator.wikimedia.org/T193651#4175624 (10chasemp) p:05Triage>03High a:03Cmjohnson This is still currently a SPOF and we are probably weeks out on the replacement systems (soon to be racked). Probably best to rep... [17:49:23] my downtime expired [17:49:25] sorry [17:49:41] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: labstore1003 SMART failure - https://phabricator.wikimedia.org/T193651#4175627 (10chasemp) [17:49:57] 10Operations, 10Puppet, 10Release-Engineering-Team, 10puppet-compiler, 10Continuous-Integration-Config: Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532#4175629 (10EddieGP) @thcipriani: Could you document this on https://wikitech.wikimedia.org/w... [17:50:31] (03Merged) 10jenkins-bot: Enable RemexHtml on wikis with <100 issues in high-priority linter cats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429357 (https://phabricator.wikimedia.org/T192299) (owner: 10Subramanya Sastry) [17:52:10] 10Operations, 10Performance-Team, 10Patch-For-Review: Migrate webperf from hafnium to webperf1001 - https://phabricator.wikimedia.org/T186774#4175647 (10Imarlier) @Operations - ottomata got the prior CR merged fine. https://gerrit.wikimedia.org/r/#/c/430421/ needs to go as well, but can happen any time. [17:52:31] subbu: Patch for the other wikis is on mwdebug1002 now [17:52:44] k [17:53:49] 10Operations: Merge one-line puppet fix - https://phabricator.wikimedia.org/T193660#4175653 (10Imarlier) [17:54:07] (03PS2) 10Imarlier: coal: require python-tz [puppet] - 10https://gerrit.wikimedia.org/r/430421 (https://phabricator.wikimedia.org/T193660) [17:54:07] RoanKattouw, testing ... good to go. [17:54:10] *tested [17:54:48] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): User[coal],Group[coal] [17:55:54] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable RemexHtml on wikis with <100 high-prio issues (T192299) (duration: 01m 17s) [17:55:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:58] T192299: Enable RemexHTML on additional wikis with < 100 errors in all high priority categories - https://phabricator.wikimedia.org/T192299 [17:58:29] RoanKattouw, i assume that pushed out the metawiki one as well since it was a dependent patch? [17:58:32] !log imarlier@tin Started deploy [performance/coal@bd7568a]: deploy coal to webperf1001 [17:58:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:03] !log mw2174 - repooled [17:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180502T1800) [18:00:35] Ah [18:00:39] Sorry, yes, oops [18:00:51] Or, no I did that separately at 10:48:26 [18:01:06] ah, ok. thanks. [18:01:56] checking meta now I don't see anything broken for now [18:03:11] (03CR) 10jenkins-bot: Enable RemexHtml on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427182 (https://phabricator.wikimedia.org/T192386) (owner: 10Subramanya Sastry) [18:03:14] (03CR) 10jenkins-bot: Enable RemexHtml on wikis with <100 issues in high-priority linter cats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429357 (https://phabricator.wikimedia.org/T192299) (owner: 10Subramanya Sastry) [18:11:24] Thanks for your help today RoanKattouw with getting that patch out [18:12:49] RECOVERY - Check systemd state on webperf2001 is OK: OK - running: The system is fully operational [18:13:21] 10Operations, 10Puppet: Knock down puppet 4 deprecation warnings - https://phabricator.wikimedia.org/T193664#4175736 (10herron) p:05Triage>03Normal [18:13:28] RECOVERY - puppet last run on webperf2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:15:23] (03CR) 10Framawiki: [C: 031] Enable ULS webfonts by default at Bengali Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430360 (https://phabricator.wikimedia.org/T193367) (owner: 10Bodhisattwa) [18:30:23] (03PS1) 10Thcipriani: Remove duplication ghostscript package declaration [puppet] - 10https://gerrit.wikimedia.org/r/430425 [18:31:51] PROBLEM - Apache HTTP on mw2175 is CRITICAL: connect to address 10.192.32.63 and port 80: Connection refused [18:31:51] PROBLEM - Apache HTTP on mw2176 is CRITICAL: connect to address 10.192.32.64 and port 80: Connection refused [18:33:31] PROBLEM - Check size of conntrack table on mw2175 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:33:31] PROBLEM - MD RAID on mw2175 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:33:31] PROBLEM - Check size of conntrack table on mw2176 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:33:31] PROBLEM - MD RAID on mw2176 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:33:36] 10Operations, 10Performance-Team, 10Patch-For-Review: Move coal from graphite#001 nodes to webperf#001 - https://phabricator.wikimedia.org/T159354#4175771 (10Imarlier) [18:34:52] RECOVERY - Apache HTTP on mw2175 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.076 second response time [18:35:11] PROBLEM - Check systemd state on mw2175 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:35:11] PROBLEM - Check systemd state on mw2176 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:36:11] narff.. yes [18:36:21] i am really trying to prevent this but i get it _every_ single time [18:36:31] it's reinstalls [18:36:42] and you cant schedule downtime for non-existing hosts [18:36:51] PROBLEM - Nginx local proxy to apache on mw2175 is CRITICAL: connect to address 10.192.32.63 and port 443: Connection refused [18:36:51] PROBLEM - Nginx local proxy to apache on mw2176 is CRITICAL: connect to address 10.192.32.64 and port 443: Connection refused [18:36:51] PROBLEM - Check the NTP synchronisation status of timesyncd on mw2175 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:36:51] PROBLEM - Check the NTP synchronisation status of timesyncd on mw2176 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [18:37:35] I wonder why the host doesn’t exist for a reinstall? is there a node deactivate step? [18:38:00] yes, wmf-auto-reimage removes it [18:38:05] then fails at puppet cert generation [18:38:17] I bet we could get rid of the node deactive and only do a cert clean [18:38:17] then i have to resume it with --no-verify --no-pxe [18:38:30] (03CR) 10Thcipriani: "for context this is a followup to Iad6ef5a56d3f3a220d6b62fc86af1c7f684ec739 That change was causing integration agents using this class to" [puppet] - 10https://gerrit.wikimedia.org/r/430425 (owner: 10Thcipriani) [18:38:34] *deactivate [18:38:34] then it starts the first puppet run and after X hours of waiting [18:38:45] at a random time it will be re-added to icinga [18:39:07] all i can do is try glancing at it every few minutes .. but it takes hours [18:39:17] and gets me every single time i start reading something else for 5 min [18:39:49] for a reinstall it seems like we could leave it in puppetdb since the host hasn’t actually gone away and will update facts after it’s rebuilt [18:40:13] so downtime would persist [18:41:40] yea, i dont know. i think v.olans had a reason for it, but not sure. he said he wants to find a way to fix the icinga issue [18:44:12] RECOVERY - Apache HTTP on mw2176 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.073 second response time [18:44:21] herron: thanks for the DNS merges btw [18:44:30] yeah you bet! [18:46:28] there should be some kind of configurable automatic hold timer built into icinga for newly-defined things (hosts or services) anyways, maybe we can patch *that* [18:46:30] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10Scoring-platform-team (Current), 10Wikimedia-Incident: Cache ORES virtualenv within versioned source - https://phabricator.wikimedia.org/T181071#4176039 (10awight) 05Open>03Resolved [18:47:24] if Foo has no history older than X (e.g. 1h, configurable), and goes CRIT, treat alerting/paging like downtimed. [18:48:11] RECOVERY - Check size of conntrack table on mw2175 is OK: OK: nf_conntrack is 0 % full [18:48:11] RECOVERY - MD RAID on mw2175 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [18:49:39] (03PS1) 10Ottomata: No-op organize kafka broker hiera in prep for main upgrade [puppet] - 10https://gerrit.wikimedia.org/r/430428 (https://phabricator.wikimedia.org/T167039) [18:50:42] RECOVERY - Nginx local proxy to apache on mw2175 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 2.926 second response time [18:51:02] 10Operations, 10Patch-For-Review, 10Scoring-platform-team (Current): Remove deprecated hosts from ORES scap config - https://phabricator.wikimedia.org/T191321#4176103 (10awight) 05Open>03Resolved [18:52:31] that's a good idea. or maybe it would simply work if "check if host exists" would be ignored when the downtime is scheduled and it would write it into the commandfile anyways .. and then see that later once the host exists [18:52:52] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11100/" [puppet] - 10https://gerrit.wikimedia.org/r/430428 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [18:52:54] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11100/" [puppet] - 10https://gerrit.wikimedia.org/r/430428 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [18:55:40] (03PS1) 10Ottomata: No-op Move Kafka version specific configs to site based hiera [puppet] - 10https://gerrit.wikimedia.org/r/430430 (https://phabricator.wikimedia.org/T167039) [18:56:19] i'm asking #icinga about that a bit [18:56:22] RECOVERY - Check systemd state on mw2175 is OK: OK - running: The system is fully operational [18:56:24] (03CR) 10Ottomata: [C: 032] No-op Move Kafka version specific configs to site based hiera [puppet] - 10https://gerrit.wikimedia.org/r/430430 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [18:57:42] RECOVERY - Check size of conntrack table on mw2176 is OK: OK: nf_conntrack is 0 % full [18:57:43] RECOVERY - MD RAID on mw2176 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [18:58:18] !log ppchelko@tin Started deploy [restbase/deploy@1093d1d]: Sample log action api 4xx with 1% probability [18:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] no_justification: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180502T1900). [19:03:23] RECOVERY - Check systemd state on mw2176 is OK: OK - running: The system is fully operational [19:03:43] RECOVERY - Nginx local proxy to apache on mw2176 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.232 second response time [19:06:34] (03PS1) 10Ottomata: No-op Remove Stretch conditionals for Kafka brokers; all are on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/430432 (https://phabricator.wikimedia.org/T167039) [19:06:52] RECOVERY - Check the NTP synchronisation status of timesyncd on mw2175 is OK: OK: synced at Wed 2018-05-02 19:06:47 UTC. [19:06:53] RECOVERY - Check the NTP synchronisation status of timesyncd on mw2176 is OK: OK: synced at Wed 2018-05-02 19:06:47 UTC. [19:08:06] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2175.codfw.wmnet [19:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:05] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2176.codfw.wmnet [19:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:17] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/11101/" [puppet] - 10https://gerrit.wikimedia.org/r/430432 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [19:10:19] (03CR) 10Ottomata: [C: 032] No-op Remove Stretch conditionals for Kafka brokers; all are on Stretch [puppet] - 10https://gerrit.wikimedia.org/r/430432 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [19:14:36] !log ppchelko@tin Finished deploy [restbase/deploy@1093d1d]: Sample log action api 4xx with 1% probability (duration: 16m 18s) [19:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:45] (03PS1) 10Ottomata: No-op Move Kafka 0.9.0.1 settings to site specific hiera [puppet] - 10https://gerrit.wikimedia.org/r/430435 (https://phabricator.wikimedia.org/T167039) [19:17:29] herron: maybe the whole thing only happens in combination with the failed puppet cert signing and then resuming it with --no-verify.. and the original issue is hopefully solved now with a longer timeout (got merged today). i am starting 3 fresh installs now and we will see soon [19:18:03] !log mw2177, mw2178, mw2179 - reinstalling with stretch [19:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:31] (03CR) 10Thcipriani: [C: 032] "As substitute choo choo driver, I approve this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430397 (owner: 10Chad) [19:19:53] (03Merged) 10jenkins-bot: group1 to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430397 (owner: 10Chad) [19:20:10] (03CR) 10jenkins-bot: group1 to wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430397 (owner: 10Chad) [19:22:59] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11102/" [puppet] - 10https://gerrit.wikimedia.org/r/430435 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [19:23:48] !log thcipriani@tin rebuilt and synchronized wikiversions files: group1 to 1.32.0-wmf.2 [19:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:55] !log thcipriani@tin rebuilt and synchronized wikiversions files: revert group1 to 1.32.0-wmf.2 [19:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:13] 10Operations, 10ops-codfw, 10cloud-services-team: labtestvirt2002 eth1 showing no carrier - https://phabricator.wikimedia.org/T193653#4176169 (10Andrew) I've confirmed that VMs on labtestvirt2002 work properly. Thank you! [19:26:50] (03PS1) 10Ottomata: No-op Set inter_broker_protocol_version for main in common hiera [puppet] - 10https://gerrit.wikimedia.org/r/430440 (https://phabricator.wikimedia.org/T167039) [19:28:38] (03CR) 10Ottomata: [C: 032] No-op Set inter_broker_protocol_version for main in common hiera [puppet] - 10https://gerrit.wikimedia.org/r/430440 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [19:31:13] (03PS1) 10Thcipriani: Revert "group1 to wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430442 [19:32:32] (03Abandoned) 10Dzahn: icinga: enable paging and set contact_group for grid engine checks [puppet] - 10https://gerrit.wikimedia.org/r/427833 (https://phabricator.wikimedia.org/T177850) (owner: 10Dzahn) [19:33:07] (03CR) 10Thcipriani: [C: 032] Revert "group1 to wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430442 (owner: 10Thcipriani) [19:33:28] RECOVERY - Device not healthy -SMART- on mw1230 is OK: (No output returned from plugin) https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mw1230&var-datasource=eqiad%2520prometheus%252Fops [19:33:28] (03CR) 10Dzahn: [C: 032] DNS: Remove mgmt DNS for db2011 [dns] - 10https://gerrit.wikimedia.org/r/430400 (https://phabricator.wikimedia.org/T187886) (owner: 10Papaul) [19:34:36] (03Merged) 10jenkins-bot: Revert "group1 to wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430442 (owner: 10Thcipriani) [19:38:01] !log mw2173 - scap pull (wasn't pooled but should have, bring up to date) [19:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:11] (03CR) 10jenkins-bot: Revert "group1 to wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430442 (owner: 10Thcipriani) [19:43:52] mutante: fingers crossed! out of curiosity which patch was for the longer timeout? [19:45:47] herron: https://gerrit.wikimedia.org/r/#/c/429738/ [19:52:08] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [19:54:49] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2173.codfw.wmnet [19:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:58] (03PS1) 10Ottomata: No-op Smart vary security_inter_broker_protocol [puppet] - 10https://gerrit.wikimedia.org/r/430446 (https://phabricator.wikimedia.org/T167039) [19:55:40] (03CR) 10jerkins-bot: [V: 04-1] No-op Smart vary security_inter_broker_protocol [puppet] - 10https://gerrit.wikimedia.org/r/430446 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [19:56:34] (03PS2) 10Ottomata: No-op Smart vary security_inter_broker_protocol [puppet] - 10https://gerrit.wikimedia.org/r/430446 (https://phabricator.wikimedia.org/T167039) [19:57:08] (03CR) 10jerkins-bot: [V: 04-1] No-op Smart vary security_inter_broker_protocol [puppet] - 10https://gerrit.wikimedia.org/r/430446 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [19:57:10] (03PS3) 10Ottomata: No-op Smart vary security_inter_broker_protocol [puppet] - 10https://gerrit.wikimedia.org/r/430446 (https://phabricator.wikimedia.org/T167039) [19:57:44] (03CR) 10jerkins-bot: [V: 04-1] No-op Smart vary security_inter_broker_protocol [puppet] - 10https://gerrit.wikimedia.org/r/430446 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [19:58:41] (03PS4) 10Ottomata: No-op Smart vary security_inter_broker_protocol [puppet] - 10https://gerrit.wikimedia.org/r/430446 (https://phabricator.wikimedia.org/T167039) [19:59:53] (03CR) 10Ottomata: [C: 032] No-op Smart vary security_inter_broker_protocol [puppet] - 10https://gerrit.wikimedia.org/r/430446 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [19:59:55] (03CR) 10Ottomata: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11105/" [puppet] - 10https://gerrit.wikimedia.org/r/430446 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: (Dis)respected human, time to deploy Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180502T2000). Please do the needful. [20:01:58] RECOVERY - mediawiki-installation DSH group on mw2173 is OK: OK [20:02:08] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [20:02:20] no parsoid deploy today. [20:06:40] (03PS1) 10Ottomata: Kafka main-codfw patch 1: inter_broker_protocol_version: 1.1.0 [puppet] - 10https://gerrit.wikimedia.org/r/430449 (https://phabricator.wikimedia.org/T167039) [20:06:42] (03PS1) 10Ottomata: Kafka main-codfw patch 2 [puppet] - 10https://gerrit.wikimedia.org/r/430450 (https://phabricator.wikimedia.org/T167039) [20:06:44] (03PS1) 10Ottomata: Kafka main-codfw patch 3 [puppet] - 10https://gerrit.wikimedia.org/r/430451 (https://phabricator.wikimedia.org/T167039) [20:11:45] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Add HTTPS support to wdqs-internal service - https://phabricator.wikimedia.org/T193473#4170599 (10Smalyshev) [20:11:55] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Add HTTPS support to wdqs-internal service - https://phabricator.wikimedia.org/T193473#4170599 (10Smalyshev) [20:16:26] 10Operations: Upgrade naos to stretch (and rename to deploy2001) - https://phabricator.wikimedia.org/T190524#4176275 (10Dzahn) 05Open>03stalled [20:16:35] 10Operations: Upgrade naos to stretch (and rename to deploy2001) - https://phabricator.wikimedia.org/T190524#4075855 (10Dzahn) p:05Triage>03Normal [20:17:11] 10Operations: replace tin (new hardware) - https://phabricator.wikimedia.org/T185275#4176277 (10Dzahn) p:05Triage>03High [20:32:25] PROBLEM - HHVM rendering on mw2177 is CRITICAL: connect to address 10.192.32.65 and port 80: Connection refused [20:32:25] PROBLEM - HHVM rendering on mw2179 is CRITICAL: connect to address 10.192.32.67 and port 80: Connection refused [20:32:25] PROBLEM - nutcracker process on mw2177 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:32:25] PROBLEM - nutcracker process on mw2179 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [20:33:00] herron: the install worked, no more failed cert signing, but doesn't mean the downtime issue is gone [20:33:19] i caught most of them in time though [20:33:31] cool [20:37:45] (03CR) 10MarcoAurelio: [C: 031] "LGTM. This new rule of having to submit two patches for a single change is just annoying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430124 (https://phabricator.wikimedia.org/T193562) (owner: 10Urbanecm) [20:38:50] (03CR) 10Urbanecm: "Yes, it is (for config), but what I can do with it. I don't want to enforce the SWATter to run a full scap which is just useless. But I ca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430124 (https://phabricator.wikimedia.org/T193562) (owner: 10Urbanecm) [20:44:45] (03CR) 10MarcoAurelio: [C: 031] "I don't blame you. We just obey the rules, even when they're just annoying (and reduce the number of patches we can schedule for SWAT)..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430124 (https://phabricator.wikimedia.org/T193562) (owner: 10Urbanecm) [20:47:52] (03CR) 10Urbanecm: "I'm just randomly thinking about the rule, sorry if I expressed my thoughts in bad way. I totally agree with you, althrough I cannot remem" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430124 (https://phabricator.wikimedia.org/T193562) (owner: 10Urbanecm) [20:54:27] 10Operations, 10ops-codfw, 10Traffic, 10netops: Interface errors on asw-d-codfw:xe-2/0/47 - https://phabricator.wikimedia.org/T193677#4176406 (10ayounsi) [20:58:38] is anyone around that can do a db query for me on the /not/ replicas? [21:01:28] Hauskatze: I could. [21:01:59] Niharika: hi: on eswiki show tables like 'flagged%'; [21:02:20] On it. [21:02:23] to see if they match https://phabricator.wikimedia.org/T193678 [21:03:37] Hauskatze: I get the same 9. [21:03:50] Niharika: thank you very much for checking :) [21:04:01] No problem. [21:06:59] What happened to the train today? [21:07:55] Niharika: it ran out of fuel, we're waiting for a tanker ;) [21:08:12] jouncebot: now [21:08:12] No deployments scheduled for the next 1 hour(s) and 51 minute(s) [21:08:17] jouncebot: next [21:08:17] In 1 hour(s) and 51 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180502T2300) [21:08:42] * Niharika fuels it with fire [21:08:44] wasn't it yesterday btw? [21:09:12] What, the train? [21:09:24] yes [21:09:30] it was yesterday [21:09:37] I hit a blocker rolling the train forward today :( [21:09:38] https://www.mediawiki.org/wiki/Special:Version is running at wmf.2 already [21:09:39] Run Tue-Wed-Thurs, doesn't it? [21:09:44] Hauskatze: https://tools.wmflabs.org/versions/ [21:09:55] https://phabricator.wikimedia.org/T191048 [21:10:30] it is Tue-Thurs. mediawiki is part of group0 so went out yesterday. [21:10:59] today was supposed to be non-wikipedia wikis + cawiki + hewiki (group1) but as soon as I rolled forward error logs exploded [21:11:22] and no errors in the canary wikis? [21:11:32] so I rolled back and here the train sits, derailed. [21:12:02] group0 gets low enough traffic that it's sometimes not obvious that rolling forward will cause a huge spike of errors. [21:12:10] * Hauskatze asks mutante about 'conftool' [21:12:30] Hauskatze: ? what about it [21:12:38] mutante: any docs I can see? [21:13:01] https://wikitech.wikimedia.org/wiki/Conftool [21:13:20] ktnx [21:13:31] page is "conftool" actual command is "confctl" [21:18:40] (03PS1) 10Ottomata: Set Rack/row info for Kafka main clusters [puppet] - 10https://gerrit.wikimedia.org/r/430497 (https://phabricator.wikimedia.org/T167039) [21:19:23] (03CR) 10Ottomata: [C: 032] Set Rack/row info for Kafka main clusters [puppet] - 10https://gerrit.wikimedia.org/r/430497 (https://phabricator.wikimedia.org/T167039) (owner: 10Ottomata) [21:30:09] 10Operations, 10Discovery-Search (Current work): search.wikimedia.org is source of lots of 500s - https://phabricator.wikimedia.org/T179266#4176550 (10EBernhardson) a:03EBernhardson [21:30:35] 10Operations, 10Puppet, 10Release-Engineering-Team, 10puppet-compiler, 10Continuous-Integration-Config: Figure out a way to enable volunteers to use the puppet compiler - https://phabricator.wikimedia.org/T192532#4176552 (10greg) >>! In T192532#4175323, @RobH wrote: > Also this needs the sign off of #rel... [21:32:30] (03PS1) 10EBernhardson: Forward response codes >= 400 on search.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430502 (https://phabricator.wikimedia.org/T179266) [21:33:51] !log failing-over RG1 to node0 on pfw3-codfw - T192104 [21:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:46] !log Disabling the link between fasw-codfw and pfw3b-codfw (backup) - T192104 [21:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:46] PROBLEM - Router interfaces on pfw3-codfw is CRITICAL: CRITICAL: host 208.80.153.197, interfaces up: 58, down: 1, dormant: 0, excluded: 3, unused: 0 [21:44:01] expected ^ [21:44:58] (03PS2) 10Ottomata: Kafka main-codfw patch 2 [puppet] - 10https://gerrit.wikimedia.org/r/430450 (https://phabricator.wikimedia.org/T167039) [21:47:05] RECOVERY - Router interfaces on pfw3-codfw is OK: OK: host 208.80.153.197, interfaces up: 66, down: 0, dormant: 0, excluded: 3, unused: 0 [21:51:58] (03PS2) 10Ottomata: Kafka main-codfw patch 3 [puppet] - 10https://gerrit.wikimedia.org/r/430451 (https://phabricator.wikimedia.org/T167039) [21:52:00] (03PS1) 10Ottomata: Kafka main-codfw patch 4 [puppet] - 10https://gerrit.wikimedia.org/r/430503 (https://phabricator.wikimedia.org/T167039) [22:07:42] 10Operations, 10Traffic, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Remove wildcard vhost for *.wikimedia.org - https://phabricator.wikimedia.org/T192206#4176764 (10EddieGP) Right, I already wondered whether we need them or they can be removed. I pushed that idea back because I don't want to mix... [22:07:46] (03CR) 10Chad: Forward response codes >= 400 on search.wikimedia.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430502 (https://phabricator.wikimedia.org/T179266) (owner: 10EBernhardson) [22:15:21] (03PS2) 10Huji: Remove lines that are now part of AbuseFilter defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424974 (https://phabricator.wikimedia.org/T178349) [22:17:14] !log Re-enabled the link between fasw-codfw and pfw3b-codfw (backup) - T192104 [22:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:30] !log Disabling the link between fasw-eqiad and pfw3b-eqiad (backup) - T192104 [22:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:02] !log start reindex of viwiki on eqiad elasticsearch, failed on last run due to unrelated issues [22:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:33] PROBLEM - Router interfaces on pfw3-eqiad is CRITICAL: CRITICAL: host 208.80.154.219, interfaces up: 60, down: 1, dormant: 0, excluded: 3, unused: 0 [22:21:33] RECOVERY - nutcracker process on mw2177 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [22:21:33] RECOVERY - HHVM rendering on mw2177 is OK: HTTP OK: HTTP/1.1 200 OK - 76073 bytes in 4.400 second response time [22:22:42] RECOVERY - HHVM rendering on mw2179 is OK: HTTP OK: HTTP/1.1 200 OK - 76073 bytes in 7.063 second response time [22:22:42] RECOVERY - nutcracker process on mw2179 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [22:23:30] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/430507/ [22:24:42] RECOVERY - Router interfaces on pfw3-eqiad is OK: OK: host 208.80.154.219, interfaces up: 69, down: 0, dormant: 0, excluded: 3, unused: 0 [22:24:45] !log Re-enabled the link between fasw-eqiad and pfw3b-eqiad (backup) - T192104 [22:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:38] (03PS5) 10Bstorm: wiki replicas: provide backward compatibility for MCR changes [puppet] - 10https://gerrit.wikimedia.org/r/430024 (https://phabricator.wikimedia.org/T174047) [22:30:19] (03CR) 10jerkins-bot: [V: 04-1] wiki replicas: provide backward compatibility for MCR changes [puppet] - 10https://gerrit.wikimedia.org/r/430024 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [22:31:47] (03PS6) 10Bstorm: wiki replicas: provide backward compatibility for MCR changes [puppet] - 10https://gerrit.wikimedia.org/r/430024 (https://phabricator.wikimedia.org/T174047) [22:34:26] (03CR) 10Bstorm: "I applied the idea of adding structure in the YAML because I think it is much cleaner that way and more flexible. I think I'll refactor s" [puppet] - 10https://gerrit.wikimedia.org/r/430024 (https://phabricator.wikimedia.org/T174047) (owner: 10Bstorm) [22:56:33] 10Operations, 10Pybal, 10Traffic: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4176886 (10ayounsi) [22:57:39] 10Operations, 10Pybal, 10Traffic, 10netops: Rename lvs* LLDP port descriptions after upgrading to stretch - https://phabricator.wikimedia.org/T192087#4176884 (10ayounsi) 05Open>03Resolved [22:59:48] jouncebot: refresh [22:59:49] I refreshed my knowledge about deployments. [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180502T2300). [23:00:04] RoanKattouw: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:19] \o/ my refresh worked [23:00:25] I'll SWAT since I'm the only one in there [23:17:45] !log catrope@tin Synchronized php-1.32.0-wmf.2/extensions/Kartographer/: Add maintenance script to purge pages with map tags (T193525) (duration: 01m 18s) [23:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:50] T193525: Write maintenance script that purges pages using maps - https://phabricator.wikimedia.org/T193525 [23:23:35] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: Rack/cable/configure ulsfo MX204 - https://phabricator.wikimedia.org/T189552#4176934 (10RobH) one of the two routers is now temp racked (not enough rack studs to actually mount, its resting on top of the other servers) with temp power/mgmt leads run. stole t... [23:27:04] (03PS2) 10EBernhardson: Forward response codes >= 400 on search.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430502 (https://phabricator.wikimedia.org/T179266) [23:27:17] (03CR) 10EBernhardson: Forward response codes >= 400 on search.wikimedia.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430502 (https://phabricator.wikimedia.org/T179266) (owner: 10EBernhardson) [23:29:44] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2177.codfw.wmnet [23:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:14] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2178.codfw.wmnet [23:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:44] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=mw2179.codfw.wmnet [23:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:46:37] (03PS1) 10Ayounsi: Add mgmt for cr3/4-ulsfo [dns] - 10https://gerrit.wikimedia.org/r/430516 (https://phabricator.wikimedia.org/T189552) [23:47:12] (03CR) 10Ayounsi: [C: 032] Add mgmt for cr3/4-ulsfo [dns] - 10https://gerrit.wikimedia.org/r/430516 (https://phabricator.wikimedia.org/T189552) (owner: 10Ayounsi)