[00:05:58] !log releases1001 - stopped puppet, manually fixing --prefix=/ci setting for jenkins process, killing it, removing init.d file, starting with systemd, jenkins now up T164030 [00:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:04] T164030: setup releases1001.eqiad.wmnet (was: setup mwreleases1001) - https://phabricator.wikimedia.org/T164030 [00:07:53] !log releases1001 - starting Jenkins setup wizard with generated admin pass [00:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:48] (03PS1) 10Dzahn: releases: fix jenkins_prefix, /ci not /jenkins [puppet] - 10https://gerrit.wikimedia.org/r/381368 (https://phabricator.wikimedia.org/T164030) [00:26:42] (03PS2) 10Dzahn: releases: fix jenkins_prefix, /ci not /jenkins [puppet] - 10https://gerrit.wikimedia.org/r/381368 (https://phabricator.wikimedia.org/T164030) [00:27:26] (03CR) 10Dzahn: [C: 032] releases: fix jenkins_prefix, /ci not /jenkins [puppet] - 10https://gerrit.wikimedia.org/r/381368 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [00:29:52] 10Operations, 10Release-Engineering-Team, 10vm-requests, 10Security-General: New ganeti VM for MW release pipeline work - https://phabricator.wikimedia.org/T163743#3645173 (10Dzahn) [00:29:54] 10Operations, 10RelEng-Archive-FY201718-Q1, 10Patch-For-Review, 10Security-General: setup releases1001.eqiad.wmnet (was: setup mwreleases1001) - https://phabricator.wikimedia.org/T164030#3645172 (10Dzahn) 05Resolved>03Open [00:31:11] 10Operations, 10RelEng-Archive-FY201718-Q1, 10Patch-For-Review, 10Security-General: setup releases1001.eqiad.wmnet (was: setup mwreleases1001) - https://phabricator.wikimedia.org/T164030#3218909 (10Dzahn) https://releases.wikimedia.org/ci/ is now usable. I went through the setup wizard, it said it was... [00:31:47] 10Operations, 10RelEng-Archive-FY201718-Q1, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Security-General: setup releases1001.eqiad.wmnet (was: setup mwreleases1001) - https://phabricator.wikimedia.org/T164030#3645178 (10Dzahn) a:05Dzahn>03demon [00:33:13] 10Operations, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10User-ArielGlenn: logrotate issue (cron spam) on dumps hosts - https://phabricator.wikimedia.org/T176810#3645184 (10Dzahn) Thanks, no more mails seen :) [00:37:13] (03Abandoned) 10Dzahn: gerrit: switch to base::service_unit and systemd [puppet] - 10https://gerrit.wikimedia.org/r/356516 (owner: 10Dzahn) [01:04:25] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6479 [01:05:34] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 4058383 keys, up 5 minutes 25 seconds - replication_delay is 0 [01:05:44] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:34:14] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [01:50:04] PROBLEM - puppet last run on mw1203 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:51:53] (03CR) 10Aude: Add loading of wikibase extensions from build (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381194 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [01:55:54] (03CR) 10Aude: Add loading of wikibase extensions from build (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381194 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [02:02:17] (03PS1) 10Aude: Stop using $wgWikibaseSharedCacheKeyPrefix from Wikidata build [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381371 (https://phabricator.wikimedia.org/T176948) [02:02:44] (03CR) 10Aude: Add loading of wikibase extensions from build (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381194 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [02:03:16] (03CR) 10Aude: [C: 031] Remove unused wmgUseWikibasePropertySuggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381193 (owner: 10Addshore) [02:18:35] RECOVERY - puppet last run on mw1203 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [04:26:44] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [04:27:04] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [04:32:45] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:35:14] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:54:08] this (small) spike here is not due to varnish-be failed fetches ^ [04:55:07] it's cp4021's frontend [05:20:58] !log powerup cp4024 T174891 [05:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:05] T174891: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891 [05:23:04] RECOVERY - Host cp4024 is UP: PING OK - Packet loss = 0%, RTA = 78.62 ms [05:24:24] PROBLEM - HTTPS Unified ECDSA on cp4024 is CRITICAL: SSL CRITICAL - OCSP staple validity for en.wikipedia.org has -1742295 seconds left [05:24:34] PROBLEM - HTTPS Unified RSA on cp4024 is CRITICAL: SSL CRITICAL - OCSP staple validity for en.wikipedia.org has -1742304 seconds left [05:25:15] PROBLEM - Freshness of zerofetch successful run file on cp4024 is CRITICAL: CRITICAL: File /var/netmapper/.update-success is more than 86400 secs old! [05:25:24] RECOVERY - HTTPS Unified ECDSA on cp4024 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345560 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2017-11-22 07:59:59 +0000 (expires in 54 days) [05:25:34] RECOVERY - HTTPS Unified RSA on cp4024 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345550 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2017-11-22 07:59:59 +0000 (expires in 54 days) [05:26:24] RECOVERY - Freshness of zerofetch successful run file on cp4024 is OK: OK [05:26:35] PROBLEM - salt-minion processes on cp4024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [05:27:28] sorry for the icinga spam! [05:28:50] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban), 10User-Joe: Install Blubber on contint1001 - https://phabricator.wikimedia.org/T175296#3645331 (10Joe) [05:34:14] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [05:39:54] PROBLEM - salt-minion processes on cp4024 is CRITICAL: NRPE: Command check_check_salt_minion not defined [05:53:09] !log cp4024 repooled T174891 [05:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:16] T174891: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891 [06:16:50] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add ruby base image and a fluentd image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/379793 (owner: 10Giuseppe Lavagetto) [06:17:07] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add image for running fluentd as a daemonset in kubernetes [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/380697 (owner: 10Giuseppe Lavagetto) [06:18:28] 10Operations, 10Goal, 10Kubernetes, 10Patch-For-Review, and 2 others: Build a slim container for fluentd - https://phabricator.wikimedia.org/T175527#3645398 (10Joe) Everything for this ticket is done. We have our own containers for fluent-bit, fluentd and kubernetes-fluentd-daemonset. [06:28:32] PROBLEM - Disk space on cp3030 is CRITICAL: DISK CRITICAL - free space: / 106 MB (1% inode=84%) [06:28:32] PROBLEM - Disk space on cp3043 is CRITICAL: DISK CRITICAL - free space: / 121 MB (1% inode=85%) [06:29:32] PROBLEM - Disk space on cp3040 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=85%) [06:29:32] PROBLEM - Disk space on cp3032 is CRITICAL: DISK CRITICAL - free space: / 147 MB (1% inode=85%) [06:29:57] fixing ^ [06:30:22] PROBLEM - Disk space on cp3033 is CRITICAL: DISK CRITICAL - free space: / 118 MB (1% inode=85%) [06:30:32] RECOVERY - Disk space on cp3030 is OK: DISK OK [06:30:32] RECOVERY - Disk space on cp3040 is OK: DISK OK [06:30:32] RECOVERY - Disk space on cp3032 is OK: DISK OK [06:30:32] RECOVERY - Disk space on cp3043 is OK: DISK OK [06:31:22] RECOVERY - Disk space on cp3033 is OK: DISK OK [07:08:57] 10Operations, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10User-ArielGlenn: logrotate issue (cron spam) on dumps hosts - https://phabricator.wikimedia.org/T176810#3645418 (10elukey) Thanks! Manually removed /etc/logrotate.d/xmldumps-nginx from ms1001 that was still spamming this morning :) [07:17:32] PROBLEM - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 804083.13 seconds [07:27:32] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [07:32:59] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3645438 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['mw1309.eqiad.wmnet'] ``` The log can be... [07:33:17] new jobrunner coming --^ (mw1309) [07:50:16] (03PS1) 10Giuseppe Lavagetto: Add prometheus-statsd-exporter image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/381401 (https://phabricator.wikimedia.org/T175539) [08:14:41] (03PS2) 10Giuseppe Lavagetto: Add prometheus-statsd-exporter image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/381401 (https://phabricator.wikimedia.org/T175539) [08:20:11] (03PS3) 10Giuseppe Lavagetto: Add prometheus-statsd-exporter image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/381401 (https://phabricator.wikimedia.org/T175539) [08:20:19] !log upgrade pybal to 1.14.0 on codfw secondaries [08:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:10] (03CR) 10Elukey: "Pcc looks good https://puppet-compiler.wmflabs.org/compiler02/8099/" [puppet] - 10https://gerrit.wikimedia.org/r/380487 (owner: 10Elukey) [08:23:20] (03PS3) 10Elukey: nutcracker: create the service only after the package install [puppet] - 10https://gerrit.wikimedia.org/r/380487 [08:26:18] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add prometheus-statsd-exporter image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/381401 (https://phabricator.wikimedia.org/T175539) (owner: 10Giuseppe Lavagetto) [08:27:32] 10Operations, 10Goal, 10Kubernetes, 10Patch-For-Review, and 2 others: Build containers for statsd, prometheus-statsd-exporter - https://phabricator.wikimedia.org/T175539#3645505 (10Joe) I built a container for prometheus-statsd-exporter from a debian package I created, if we decide to proxy data to statsd... [08:43:41] (03CR) 10Elukey: [C: 032] nutcracker: create the service only after the package install [puppet] - 10https://gerrit.wikimedia.org/r/380487 (owner: 10Elukey) [08:47:33] (03PS1) 10Giuseppe Lavagetto: Add statsd-proxy image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/381411 [08:49:11] (03PS2) 10Giuseppe Lavagetto: Add statsd-proxy image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/381411 [08:50:53] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/furl] [08:53:02] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [08:53:05] just ran puppet again, all good [08:54:50] (03PS3) 10Giuseppe Lavagetto: Add statsd-proxy image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/381411 [08:57:00] (03PS1) 10Elukey: profile::kafka::broker: allow prometheus masters for port 7800 [puppet] - 10https://gerrit.wikimedia.org/r/381412 (https://phabricator.wikimedia.org/T175922) [08:57:27] (03CR) 10jerkins-bot: [V: 04-1] profile::kafka::broker: allow prometheus masters for port 7800 [puppet] - 10https://gerrit.wikimedia.org/r/381412 (https://phabricator.wikimedia.org/T175922) (owner: 10Elukey) [08:59:55] you are right jenkins [08:59:58] (03PS2) 10Elukey: profile::kafka::broker: allow prometheus masters for port 7800 [puppet] - 10https://gerrit.wikimedia.org/r/381412 (https://phabricator.wikimedia.org/T175922) [09:10:24] (03CR) 10Elukey: [C: 032] "pcc looks good: https://puppet-compiler.wmflabs.org/compiler02/8101/kafka-jumbo1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/381412 (https://phabricator.wikimedia.org/T175922) (owner: 10Elukey) [09:10:49] (03PS1) 10Ema: prometheus: add pybal configuration for prometheus::ops [puppet] - 10https://gerrit.wikimedia.org/r/381414 (https://phabricator.wikimedia.org/T171710) [09:14:12] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active, AS6939/IPv4: Connect [09:16:13] RECOVERY - BGP status on cr2-ulsfo is OK: BGP OK - up: 78, down: 0, shutdown: 2 [09:17:07] ((External AS 6939) failed: Connection reset by peer) [09:17:08] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename of CodeCat → Rua: supervision needed - https://phabricator.wikimedia.org/T176985#3645614 (10alanajjar) Thanks @Legoktm , @MarcoAurelio and @Marostegui [09:18:22] (03PS2) 10Ema: prometheus: add pybal configuration for prometheus::ops [puppet] - 10https://gerrit.wikimedia.org/r/381414 (https://phabricator.wikimedia.org/T171710) [09:21:22] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect, AS6939/IPv6: Active [09:23:24] RECOVERY - BGP status on cr2-ulsfo is OK: BGP OK - up: 78, down: 0, shutdown: 2 [09:30:32] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect, AS6939/IPv4: Connect [09:30:52] (03PS1) 10Ema: prometheus: update IPVS aggregation rules [puppet] - 10https://gerrit.wikimedia.org/r/381417 [09:33:22] !log upgrade pybal to 1.14.0 on codfw primaries [09:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:42] RECOVERY - BGP status on cr2-ulsfo is OK: BGP OK - up: 78, down: 0, shutdown: 2 [09:39:14] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban), 10User-Joe: Install Blubber on contint1001 - https://phabricator.wikimedia.org/T175296#3645691 (10akosiaris) >>! In T175296#3645330, @Joe wrote: > It should be built on boron (or copper) and added to the main component IMO. Either `mai... [09:40:32] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] [09:45:15] this is the new jobrunner --^ [09:45:25] I merged my patch after running the reimage :( [09:48:02] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active, AS6939/IPv4: Connect [09:49:02] RECOVERY - BGP status on cr2-ulsfo is OK: BGP OK - up: 78, down: 0, shutdown: 2 [09:53:03] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3645726 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1309.eqiad.wmnet'] ``` and were **ALL** successful. [09:53:52] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [09:54:12] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect, AS6939/IPv6: Active [09:55:07] !log elukey@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1309.eqiad.wmnet [09:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:22] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 21 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [09:56:52] (03PS3) 10Ema: prometheus: add pybal configuration for prometheus::ops [puppet] - 10https://gerrit.wikimedia.org/r/381414 (https://phabricator.wikimedia.org/T171710) [10:01:22] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 13 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [10:03:12] (03CR) 10Filippo Giunchedi: "LGTM, modulo class_name" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/381414 (https://phabricator.wikimedia.org/T171710) (owner: 10Ema) [10:03:32] RECOVERY - BGP status on cr2-ulsfo is OK: BGP OK - up: 78, down: 0, shutdown: 2 [10:04:03] (03CR) 10Ema: prometheus: add pybal configuration for prometheus::ops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/381414 (https://phabricator.wikimedia.org/T171710) (owner: 10Ema) [10:04:52] (03PS4) 10Ema: prometheus: add pybal configuration for prometheus::ops [puppet] - 10https://gerrit.wikimedia.org/r/381414 (https://phabricator.wikimedia.org/T171710) [10:07:46] (03CR) 10Ema: [C: 032] prometheus: add pybal configuration for prometheus::ops [puppet] - 10https://gerrit.wikimedia.org/r/381414 (https://phabricator.wikimedia.org/T171710) (owner: 10Ema) [10:18:33] (03PS1) 10Ladsgroup: mediawiki: stop rebuilding wb_terms table [puppet] - 10https://gerrit.wikimedia.org/r/381421 (https://phabricator.wikimedia.org/T171460) [10:21:22] (03CR) 10Jcrespo: [C: 032] mediawiki: stop rebuilding wb_terms table [puppet] - 10https://gerrit.wikimedia.org/r/381421 (https://phabricator.wikimedia.org/T171460) (owner: 10Ladsgroup) [10:23:06] (03PS4) 10Giuseppe Lavagetto: Add statsd-proxy image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/381411 [10:25:07] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add statsd-proxy image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/381411 (owner: 10Giuseppe Lavagetto) [10:26:03] (03PS1) 10Alexandros Kosiaris: check_interval, retry_interval for screen/tmux check [puppet] - 10https://gerrit.wikimedia.org/r/381422 (https://phabricator.wikimedia.org/T165348) [10:32:03] ACKNOWLEDGEMENT - MariaDB Slave Lag: s5 on dbstore1001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 807347.83 seconds Jcrespo broken [10:33:11] 10Operations, 10Goal, 10Kubernetes, 10Services (watching), 10User-Joe: Standardize on the "default" pod setup - https://phabricator.wikimedia.org/T170120#3645796 (10Joe) [10:33:13] 10Operations, 10Goal, 10Kubernetes, 10Patch-For-Review, and 2 others: Build a slim container for fluentd - https://phabricator.wikimedia.org/T175527#3645795 (10Joe) 05Open>03Resolved [10:33:40] 10Operations, 10Goal, 10Kubernetes, 10Patch-For-Review, and 2 others: Build a slim container for fluentd - https://phabricator.wikimedia.org/T175527#3595715 (10Joe) [10:33:46] 10Operations, 10Goal, 10Kubernetes, 10Patch-For-Review, and 2 others: Build containers for statsd, prometheus-statsd-exporter - https://phabricator.wikimedia.org/T175539#3645798 (10Joe) 05Open>03Resolved [10:42:16] 10Operations, 10Goal, 10Kubernetes, 10Services (watching), 10User-Joe: Standardize on the "default" pod setup - https://phabricator.wikimedia.org/T170120#3645803 (10Joe) We've built the containers for logging and metrics reporting, and designed how to unclude them into the standard pod setup. While I co... [10:42:28] 10Operations, 10Goal, 10Kubernetes: Operations Q1 goal: Streamlined Service Delivery - https://phabricator.wikimedia.org/T170108#3645805 (10Joe) [10:42:31] 10Operations, 10Goal, 10Kubernetes, 10Services (watching), 10User-Joe: Standardize on the "default" pod setup - https://phabricator.wikimedia.org/T170120#3645804 (10Joe) 05Open>03Resolved [10:44:09] (03CR) 10Aude: Add loading of wikibase extensions from build (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381194 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [10:44:46] 10Operations, 10Continuous-Integration-Infrastructure (shipyard): Update docker image docker-registry.wikimedia.org/wikimedia-jessie - https://phabricator.wikimedia.org/T177055#3645812 (10hashar) [10:47:46] https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1 [10:47:50] This is super bad [10:48:01] 12M in two or three days? [10:49:55] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3645836 (10Ladsgroup) The jobqueue size just bumped to 12M in two days and it's not going down. I don't know if it's related to wikidata or not but that'... [10:51:51] (03CR) 10Filippo Giunchedi: [C: 031] prometheus: update IPVS aggregation rules [puppet] - 10https://gerrit.wikimedia.org/r/381417 (owner: 10Ema) [10:52:05] (03CR) 10Addshore: Add loading of wikibase extensions from build (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381194 (https://phabricator.wikimedia.org/T176948) (owner: 10Addshore) [10:52:54] Amir1: two new jobrunners have been added to the pool in the past two days, so I suspect that something weird is going on [10:53:01] Cc: _joe_ [10:53:35] let's see if the timing matches [10:55:27] mw1308 finished the 28th at ~12 UTC (if phab time is UTC, think so but never checked) https://phabricator.wikimedia.org/T165519#3642140 [10:57:42] meanwhile mw1309 ~1 hour ago [10:57:50] I am checking all the wiki's on terbium [11:00:06] seems refreshlinks? [11:01:54] think renameUser [11:02:09] https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1&var-jobType=renameUser [11:02:23] that might have a spike but not sure if there are other reasons [11:03:32] its an aude! :D [11:03:41] maybe also thumbnail render [11:03:44] hi addshore :-) [11:04:18] aude: thanks for your review on the patch, I'm probably going to try and switch over to using config in mediawiki-config at the start of next week FYI [11:04:47] I merged the change adding all the wikibase extensions to wmf-make-branch so next week they should be in the branches [11:05:10] and once the config is in mediawik-config for loading the extensions switching them over one by one should be pretty easy [11:05:12] !log precautionary stop of jobrunner/jobchron on the new mw130[8,9] - job queue size rapid increase investigation [11:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:54] addshore: sounds good [11:07:06] aude: I believe we are going to go for the apporach of a manual list of classes (as done in other extensions) until we have a psr4 solution [11:07:14] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2053 and db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381423 [11:07:18] (03PS2) 10Marostegui: Revert "db-codfw.php: Depool db2053 and db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381423 [11:07:46] :/ [11:10:09] in commons wiki I can see refreshLinks: 2059769 queued; 47 claimed (16 active, 31 abandoned); 0 delayed [11:11:37] also https://grafana.wikimedia.org/dashboard/db/job-queue-rate?panelId=7&fullscreen&orgId=1&refresh=1m&from=now-7d&to=now (isolating refreshlinks) seems matching [11:12:23] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2053 and db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381423 (owner: 10Marostegui) [11:12:50] someone recently renamed a user with 200K edits [11:13:42] (03PS6) 10Jcrespo: mariadb: Implement regular logical backups using mydumper [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516) [11:14:27] oh joy [11:14:43] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2053 and db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381423 (owner: 10Marostegui) [11:15:28] Amir1,aude - I have stopped the jobrunners on the new hosts just in case (timing was suspiciously matching) [11:15:32] 10Operations, 10Wikidata, 10wikiba.se, 10Wikidata-Sprint, 10Wikidata-Sprint-2016-11-08: Create wikibase/wikiba.se-deploy repo - https://phabricator.wikimedia.org/T176841#3645898 (10Ladsgroup) @hashar told me to make a phab card but either is fine I guess [11:15:53] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2053 db2046 - T174509 (duration: 00m 48s) [11:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:57] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [11:16:05] if we are sure that the problem is that renamed user I'll of course re-enable them, since we'll probably need a lot of horsepower :D [11:16:29] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2053 and db2046" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381423 (owner: 10Marostegui) [11:18:30] (03PS2) 10Ema: prometheus: update IPVS aggregation rules [puppet] - 10https://gerrit.wikimedia.org/r/381417 [11:18:35] (03CR) 10Ema: [V: 032 C: 032] prometheus: update IPVS aggregation rules [puppet] - 10https://gerrit.wikimedia.org/r/381417 (owner: 10Ema) [11:18:42] (03CR) 10Marostegui: [C: 031] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo) [11:36:44] Amir1, aude - I didn't see it but cirrusSearchLinksUpdate is 5M for commonswiki [11:36:53] (Joe is checking atm) [11:42:11] elukey: i'm really not sure the problem but rename user was one of the things that appeared to spike [11:42:19] or might be combination of things [11:42:42] sure sure :) [11:42:59] it seems that a ton of jobs got enqueued recently, most of them in commons [11:43:04] (03PS1) 10BBlack: cp4021: switch numa mode from "isolate" to "on" [puppet] - 10https://gerrit.wikimedia.org/r/381429 [11:44:02] I am re-enabling the jobrunners since those poor guys were not the issue [11:44:14] !log re-enable job runner daemons on mw130[8,9] [11:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:37] (03CR) 10BBlack: [C: 032] cp4021: switch numa mode from "isolate" to "on" [puppet] - 10https://gerrit.wikimedia.org/r/381429 (owner: 10BBlack) [11:53:49] !log cp4021: reboot for numa mode switch [11:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:58] (03PS1) 10Ladsgroup: mediawiki: Maintenance script to clean up duplicates in wb_terms [puppet] - 10https://gerrit.wikimedia.org/r/381433 (https://phabricator.wikimedia.org/T163551) [12:43:18] !log removed numa=isolate sys entries manually (runtime + future boots) from cp4021 [12:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:08] PROBLEM - Disk space on mw1265 is CRITICAL: DISK CRITICAL - free space: / 16120 MB (3% inode=97%) [13:00:54] whaaat [13:09:29] !log depool mw1265 (hhvm 3.18.5) - disk filled up [13:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:51] (03PS1) 10Marostegui: Revert "db-codfw: Depool db2070, db2069, db2042" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381438 [13:14:55] (03PS2) 10Marostegui: Revert "db-codfw: Depool db2070, db2069, db2042" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381438 [13:17:03] (03CR) 10Marostegui: [C: 032] Revert "db-codfw: Depool db2070, db2069, db2042" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381438 (owner: 10Marostegui) [13:18:12] !log starting a round of cleanup in ores_classification table in enwiki (T159753) [13:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:18] T159753: Concerns about ores_classification table size on enwiki - https://phabricator.wikimedia.org/T159753 [13:18:21] (03PS6) 10Rush: openstack: pdns auth module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/381295 (https://phabricator.wikimedia.org/T171494) [13:20:25] (03Merged) 10jenkins-bot: Revert "db-codfw: Depool db2070, db2069, db2042" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381438 (owner: 10Marostegui) [13:20:35] (03CR) 10jenkins-bot: Revert "db-codfw: Depool db2070, db2069, db2042" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381438 (owner: 10Marostegui) [13:21:34] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2070, db2069 and db2042 - T174509 (duration: 00m 47s) [13:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:40] T174509: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509 [13:24:25] (03CR) 10Marostegui: "Not sure I want this enabled on a Friday :-)" [puppet] - 10https://gerrit.wikimedia.org/r/381433 (https://phabricator.wikimedia.org/T163551) (owner: 10Ladsgroup) [13:28:04] (03CR) 10Ladsgroup: "Yeah, definitely. I just made the patch ready :)" [puppet] - 10https://gerrit.wikimedia.org/r/381433 (https://phabricator.wikimedia.org/T163551) (owner: 10Ladsgroup) [13:28:29] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 18817.07 seconds [13:28:38] PROBLEM - MariaDB Slave Lag: s4 on db2019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 34797.41 seconds [13:28:38] PROBLEM - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 10957.30 seconds [13:28:48] PROBLEM - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 7638.72 seconds [13:31:13] I am going to assume those are some expired downtime on maintenance [13:31:23] I will check [13:31:28] RECOVERY - Disk space on mw1265 is OK: DISK OK [13:32:03] it looks like it [13:32:09] will ack the [13:32:10] m [13:33:02] jynus: they are [13:33:18] thanks [13:33:59] going to downtime them again [13:34:09] ACKNOWLEDGEMENT - MariaDB Slave Lag: s4 on db2019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 35103.26 seconds Jcrespo Ongoing maintenance [13:34:09] ACKNOWLEDGEMENT - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 17766.17 seconds Jcrespo Ongoing maintenance [13:34:09] ACKNOWLEDGEMENT - MariaDB Slave Lag: s4 on db2058 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 6039.65 seconds Jcrespo Ongoing maintenance [13:34:09] ACKNOWLEDGEMENT - MariaDB Slave Lag: s4 on db2065 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 9455.58 seconds Jcrespo Ongoing maintenance [13:34:21] don't worry, ack will take care of it^ [13:34:43] they should be finished within the next few hours [13:52:48] (03PS1) 10Rush: openstack: pdns fixup SOA default answer [puppet] - 10https://gerrit.wikimedia.org/r/381445 (https://phabricator.wikimedia.org/T171494) [13:53:44] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3646384 (10Joe) ``` oblivian@terbium:~$ /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/group1.dblist showJobs.php --group | awk '{if ($3 > 100... [13:58:18] RECOVERY - MariaDB Slave Lag: s4 on db2058 is OK: OK slave_sql_lag Replication lag: 0.17 seconds [14:10:18] RECOVERY - MariaDB Slave Lag: s4 on db2065 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [14:12:19] (03Draft3) 10Jayprakash12345: Temporary IP Cap Lift on zh.wiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381442 (https://phabricator.wikimedia.org/T177071) [14:14:10] (03PS13) 10Zoranzoki21: Enable RemexHTML on wikitech and eswikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379966 (https://phabricator.wikimedia.org/T175971) [14:15:39] (03CR) 10Zoranzoki21: [C: 031] "Super" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381442 (https://phabricator.wikimedia.org/T177071) (owner: 10Jayprakash12345) [14:21:52] Hello, This is urgent Task (https://phabricator.wikimedia.org/T177071). I saw that Next Deployments are on October 2nd. [14:22:23] This event happen on October 1, So See this. [14:23:54] (03CR) 10Samtar: [C: 031] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/381442 (https://phabricator.wikimedia.org/T177071) (owner: 10Jayprakash12345) [14:40:11] (03PS1) 10Andrew Bogott: move rabbitmq drain_queue script to the rabbitmq module [puppet] - 10https://gerrit.wikimedia.org/r/381449 [14:40:13] (03PS1) 10Andrew Bogott: rabbitmq: drain and log notifications.error hourly [puppet] - 10https://gerrit.wikimedia.org/r/381450 (https://phabricator.wikimedia.org/T175029) [14:40:45] we don't deploy out of deployment windows, requestor should have been more diligent [14:43:05] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3263810 (10herron) Today there are ~20 unhandled screen/tmux problems in icinga. Maybe this number will decrease after handling the initial problems, but I could also see this... [14:51:37] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler02/8106/dbstore2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo) [14:55:26] chasemp: hiiii would love a review over here when you find some time: https://gerrit.wikimedia.org/r/#/c/379004/ [14:55:34] (03CR) 10BryanDavis: rabbitmq: drain and log notifications.error hourly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/381450 (https://phabricator.wikimedia.org/T175029) (owner: 10Andrew Bogott) [14:57:07] (03CR) 10BryanDavis: [C: 031] move rabbitmq drain_queue script to the rabbitmq module [puppet] - 10https://gerrit.wikimedia.org/r/381449 (owner: 10Andrew Bogott) [14:58:21] (03CR) 10BryanDavis: [C: 031] rabbitmq: drain and log notifications.error hourly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/381450 (https://phabricator.wikimedia.org/T175029) (owner: 10Andrew Bogott) [14:58:28] (03PS2) 10Dzahn: check_interval, retry_interval for screen/tmux check [puppet] - 10https://gerrit.wikimedia.org/r/381422 (https://phabricator.wikimedia.org/T165348) (owner: 10Alexandros Kosiaris) [14:59:22] (03CR) 10Andrew Bogott: [C: 032] move rabbitmq drain_queue script to the rabbitmq module [puppet] - 10https://gerrit.wikimedia.org/r/381449 (owner: 10Andrew Bogott) [14:59:31] (03CR) 10Andrew Bogott: [C: 032] rabbitmq: drain and log notifications.error hourly [puppet] - 10https://gerrit.wikimedia.org/r/381450 (https://phabricator.wikimedia.org/T175029) (owner: 10Andrew Bogott) [14:59:51] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3646585 (10Dzahn) >>! In T165348#3646509, @herron wrote: > Today there are ~20 unhandled screen/tmux problems in icinga. This means adding the monitoring was a success, right... [15:00:34] (03CR) 10Dzahn: [C: 032] "thanks, good point" [puppet] - 10https://gerrit.wikimedia.org/r/381422 (https://phabricator.wikimedia.org/T165348) (owner: 10Alexandros Kosiaris) [15:00:42] (03PS3) 10Dzahn: check_interval, retry_interval for screen/tmux check [puppet] - 10https://gerrit.wikimedia.org/r/381422 (https://phabricator.wikimedia.org/T165348) (owner: 10Alexandros Kosiaris) [15:02:32] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3263810 (10elukey) If nobody disagrees I'd whitelist stat100[456] boxes since several people are going to keep using screen/tmux for long computations (like researchers, analyst... [15:04:03] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3646605 (10Dzahn) I suggested whitelisting these (and other hosts regularly used for screens which i found after my initial check). The reaction was that i should please justify... [15:06:28] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/drain_queue] [15:06:38] PROBLEM - exim queue on mx1001 is CRITICAL: CRITICAL: 3051 mails in exim queue. [15:07:12] (03PS1) 10Andrew Bogott: rabbitmq: fix a broken filepath from a previous patch [puppet] - 10https://gerrit.wikimedia.org/r/381454 [15:07:18] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 0.36 seconds [15:07:38] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/drain_queue] [15:07:40] (03CR) 10Andrew Bogott: [C: 032] rabbitmq: fix a broken filepath from a previous patch [puppet] - 10https://gerrit.wikimedia.org/r/381454 (owner: 10Andrew Bogott) [15:09:23] ottomata: I'll try to take a look today [15:09:38] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [15:09:46] 10Operations, 10Cloud-Services, 10Cloud-VPS, 10Patch-For-Review, 10Wikimedia-Incident: Some labs instances IP have multiple PTR entries in DNS - https://phabricator.wikimedia.org/T115194#3646634 (10Andrew) 05Open>03Resolved This is as fixed as it's going to be. Any time there's a designate outage I... [15:09:58] thanks! :) [15:10:59] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3646639 (10herron) >>! In T165348#3646585, @Dzahn wrote: >>>! In T165348#3646509, @herron wrote: >> Today there are ~20 unhandled screen/tmux problems in icinga. > > This mea... [15:12:43] chasemp: ! in gerrit, you are not Rush? [15:12:56] (03PS1) 10Dzahn: screen-monitor: un-whitelist puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/381456 (https://phabricator.wikimedia.org/T165348) [15:13:55] ottomata: I am :) I have chasemp in ldap from before employment but rush is my opsy shell name [15:14:05] only chasemp was there, i try to check it but often forget [15:14:23] ahh, so i added wrong reviewer a while ago [15:14:23] ok [15:14:33] (03PS2) 10Dzahn: screen-monitor: un-whitelist puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/381456 (https://phabricator.wikimedia.org/T165348) [15:14:42] wait, which one should I add? [15:14:47] when i want review? [15:14:48] heh, rush [15:15:11] (03CR) 10Dzahn: [C: 032] screen-monitor: un-whitelist puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/381456 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [15:19:08] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [15:22:22] (03PS1) 10Dzahn: screen-monitor: whitelist cluster::management [puppet] - 10https://gerrit.wikimedia.org/r/381461 (https://phabricator.wikimedia.org/T165348) [15:23:06] (03CR) 10Dzahn: [C: 032] screen-monitor: whitelist cluster::management [puppet] - 10https://gerrit.wikimedia.org/r/381461 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [15:23:37] (03PS1) 10Elukey: screen-monitor: whitelist analytics hosts [puppet] - 10https://gerrit.wikimedia.org/r/381462 (https://phabricator.wikimedia.org/T165348) [15:24:03] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3646693 (10Dzahn) >>! In T165348#3641893, @Volans wrote: > - `ms-fe1005` should be whitelisted until T162123 is done Done > - I don't think puppetmasters should be whitelisted... [15:25:33] 10Operations, 10Cloud-Services, 10wikitech.wikimedia.org, 10cloud-services-team (Kanban): Set up external DNS record for wikitech-static - https://phabricator.wikimedia.org/T164290#3646694 (10Andrew) a:05Andrew>03None [15:25:47] (03CR) 10Dzahn: "could we do this by role names instead of host names? it needs less edits in the future. Or alternatively by regex.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/381462 (https://phabricator.wikimedia.org/T165348) (owner: 10Elukey) [15:28:14] (03CR) 10Dzahn: "here was my previous suggestion that included these by roles https://gerrit.wikimedia.org/r/#/c/377823/7" [puppet] - 10https://gerrit.wikimedia.org/r/381462 (https://phabricator.wikimedia.org/T165348) (owner: 10Elukey) [15:30:07] (03CR) 10Elukey: "Sure, but I wanted to have a clear message about why that value was set (it changes between stat boxes and druid1003) :)" [puppet] - 10https://gerrit.wikimedia.org/r/381462 (https://phabricator.wikimedia.org/T165348) (owner: 10Elukey) [15:32:06] (03CR) 10Dzahn: "yea, the comments are good but isn't it still "1 role = 1 comment" ?" [puppet] - 10https://gerrit.wikimedia.org/r/381462 (https://phabricator.wikimedia.org/T165348) (owner: 10Elukey) [15:33:17] (03CR) 10Elukey: "For stat boxes it might be the case, but not for druid (since only one host has these temporary tmux sessions opened)" [puppet] - 10https://gerrit.wikimedia.org/r/381462 (https://phabricator.wikimedia.org/T165348) (owner: 10Elukey) [15:34:05] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [15:34:36] 10Operations, 10ops-eqiad: rack/setup/install flerovium.eqiad.wmnet - https://phabricator.wikimedia.org/T176505#3646721 (10Cmjohnson) [15:35:55] (03CR) 10Dzahn: "let's just whitelist all druid hosts then. i mean if druid hosts are used for long-running sessions any of them could be used in the futur" [puppet] - 10https://gerrit.wikimedia.org/r/381462 (https://phabricator.wikimedia.org/T165348) (owner: 10Elukey) [15:36:38] 10Operations, 10ops-eqiad: rack/setup/install flerovium.eqiad.wmnet - https://phabricator.wikimedia.org/T176505#3627849 (10Cmjohnson) Server is racked in d2, during setup I am unable to get a link from the disk shelves to the PERC Raid card. The card is being seen in the system bios and I am able to view the... [15:37:10] (03PS1) 10Elukey: hieradata::regex: remove mw130[89] from the whitelist appservers [puppet] - 10https://gerrit.wikimedia.org/r/381464 (https://phabricator.wikimedia.org/T165519) [15:37:44] (03CR) 10Elukey: [C: 032] hieradata::regex: remove mw130[89] from the whitelist appservers [puppet] - 10https://gerrit.wikimedia.org/r/381464 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [15:40:00] (03Abandoned) 10Elukey: screen-monitor: whitelist analytics hosts [puppet] - 10https://gerrit.wikimedia.org/r/381462 (https://phabricator.wikimedia.org/T165348) (owner: 10Elukey) [15:40:44] (03PS7) 10Jcrespo: mariadb: Implement regular logical backups using mydumper [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516) [15:41:27] mutante: argh I just realized that the stat* boxes have multiple roles assigned [15:41:43] in site.pp [15:42:07] yea, ideally they would just have a single role. but until then, just pick the one that is more likely to stay? [15:42:19] analytics_cluster::client ? [15:42:39] that's what i originally did here https://gerrit.wikimedia.org/r/#/c/377823/7 [15:42:45] druid::worker [15:42:53] analytics_cluster::coordinator [15:43:48] it's enough if one of them matches, so the "client" one covers 1004 and 1005 [15:44:32] but yea, role/profile conversion would mean just 1 role per node [15:45:43] or if you think the roles will not stay the same or there might be hosts without roles.. we can use regex.yaml at the bottom [15:46:12] maybe regex is better, I am afraid that with refactoring etc.. I'll make some confusion [15:46:42] see at the very bottom, i added a section there already that covers db/es [15:46:55] maybe just put the comments there.. should be just fine too [15:47:31] i still think it's slightly better than hostnames [15:47:37] and in one place to find [15:48:15] okok [15:52:52] !log re-imaging labvirt1015 in order to get it on the standard 4.4.0-81-generic kernel [15:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:14] (03CR) 10Jcrespo: [C: 032] mariadb: Implement regular logical backups using mydumper [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo) [15:58:17] (03PS1) 10Elukey: screen-monitor: whitelist stat and druid hosts [puppet] - 10https://gerrit.wikimedia.org/r/381469 (https://phabricator.wikimedia.org/T165348) [15:58:58] mutante: better? --^ [15:59:35] (03CR) 10Dzahn: [C: 032] screen-monitor: whitelist stat and druid hosts [puppet] - 10https://gerrit.wikimedia.org/r/381469 (https://phabricator.wikimedia.org/T165348) (owner: 10Elukey) [15:59:37] yea :) [15:59:44] thanks [15:59:44] \o/ thanks! [16:00:14] they will disappear from Icinga after puppet ran on each of them .. and on icinga server [16:02:50] 10Operations, 10netops: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3646919 (10ayounsi) [16:06:20] (03PS1) 10Jcrespo: mariadb-backups: Fix require on cronjob [puppet] - 10https://gerrit.wikimedia.org/r/381472 (https://phabricator.wikimedia.org/T169516) [16:06:50] (03PS1) 10Dzahn: releases: add missing Jenkins proxy setup [puppet] - 10https://gerrit.wikimedia.org/r/381473 (https://phabricator.wikimedia.org/T164030) [16:07:14] (03CR) 10Marostegui: [C: 031] mariadb-backups: Fix require on cronjob [puppet] - 10https://gerrit.wikimedia.org/r/381472 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo) [16:09:24] (03PS2) 10Dzahn: releases: add missing Jenkins proxy setup [puppet] - 10https://gerrit.wikimedia.org/r/381473 (https://phabricator.wikimedia.org/T164030) [16:11:33] (03CR) 10Jcrespo: [C: 032] mariadb-backups: Fix require on cronjob [puppet] - 10https://gerrit.wikimedia.org/r/381472 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo) [16:11:50] (03PS1) 10Elukey: screen-monitor: remove druid from whitelisted hosts [puppet] - 10https://gerrit.wikimedia.org/r/381474 (https://phabricator.wikimedia.org/T165348) [16:12:32] (03CR) 10Elukey: [C: 032] screen-monitor: remove druid from whitelisted hosts [puppet] - 10https://gerrit.wikimedia.org/r/381474 (https://phabricator.wikimedia.org/T165348) (owner: 10Elukey) [16:12:41] (03PS2) 10Elukey: screen-monitor: remove druid from whitelisted hosts [puppet] - 10https://gerrit.wikimedia.org/r/381474 (https://phabricator.wikimedia.org/T165348) [16:13:11] oh :) [16:14:03] My team just got rid of them, better :) [16:14:23] jynus: feel free to merge anytime [16:14:31] I had already [16:14:59] ah weird I got yours in puppet-merge [16:15:02] okok running it now [16:15:21] there is an unfixed race condition there [16:29:18] 10Operations, 10netops: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3647025 (10ayounsi) [16:37:16] !log re-imaging labvirt1017, 1018 in order to get it on the standard 4.4.0-81-generic kernel [16:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:18] 10Operations, 10ops-eqiad: rack/setup/install flerovium.eqiad.wmnet - https://phabricator.wikimedia.org/T176505#3647061 (10Cmjohnson) [16:45:58] 10Operations, 10ops-eqiad: rack/setup/install flerovium.eqiad.wmnet - https://phabricator.wikimedia.org/T176505#3627849 (10Cmjohnson) a:05Cmjohnson>03RobH [16:46:42] 10Operations, 10ops-eqiad: rack/setup/install flerovium.eqiad.wmnet - https://phabricator.wikimedia.org/T176505#3627849 (10Cmjohnson) [16:57:24] 10Operations, 10Mail, 10Wikidata: Large number of "A page you created was linked on Wikidata" emails to one recipient in short period of time - https://phabricator.wikimedia.org/T177099#3647111 (10herron) [17:00:02] (03PS2) 10Rush: openstack: pdns fixup SOA default answer [puppet] - 10https://gerrit.wikimedia.org/r/381445 (https://phabricator.wikimedia.org/T171494) [17:00:04] !log kicking off extra job runners to process jobs moving commonswiki Category pages from general to content serach index [17:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:09] (03PS1) 10RobH: flerovium.eqiad.wmnet install params [puppet] - 10https://gerrit.wikimedia.org/r/381479 (https://phabricator.wikimedia.org/T176505) [17:00:53] (03PS3) 10Rush: openstack: pdns fixup SOA default answer [puppet] - 10https://gerrit.wikimedia.org/r/381445 (https://phabricator.wikimedia.org/T171494) [17:01:10] 10Operations, 10Mail, 10Wikidata: Large number of "A page you created was linked on Wikidata" emails to one recipient in short period of time - https://phabricator.wikimedia.org/T177099#3647100 (10Lydia_Pintscher) My best guess: items being created for articles on cebuano Wikipedia en-mas. Most of these arti... [17:01:15] (03CR) 10RobH: [C: 032] flerovium.eqiad.wmnet install params [puppet] - 10https://gerrit.wikimedia.org/r/381479 (https://phabricator.wikimedia.org/T176505) (owner: 10RobH) [17:03:12] (03PS7) 10Rush: openstack: pdns auth module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/381295 (https://phabricator.wikimedia.org/T171494) [17:11:36] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3647138 (10EBernhardson) > I think we might be able to add some capacity to processing those jobs on monday, but we probably have either to re-think the... [17:12:11] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga check for sysctl settings - https://phabricator.wikimedia.org/T160060#3647139 (10herron) Are there testing or canary servers that this check could be deployed to for testing on "real" systems? I've performed light testing in a vagrant box bu... [17:17:50] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/8095/" [puppet] - 10https://gerrit.wikimedia.org/r/376048 (https://phabricator.wikimedia.org/T109903) (owner: 10Herron) [17:19:04] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install flerovium.eqiad.wmnet - https://phabricator.wikimedia.org/T176505#3647154 (10RobH) a:05RobH>03faidon [17:20:09] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install flerovium.eqiad.wmnet - https://phabricator.wikimedia.org/T176505#3627849 (10RobH) This is installed with the puppet keys signed. Adidtionally, this has the two shelves calbed directly to the system indpeendently, not in a daisy chain. I th... [17:22:30] PROBLEM - puppet last run on analytics1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:22:39] PROBLEM - puppet last run on radium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:22:39] PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:22:40] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:22:49] PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:22:51] PROBLEM - puppet last run on labsdb1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:22:51] PROBLEM - puppet last run on wtp1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:22:59] PROBLEM - puppet last run on elastic1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:23:09] PROBLEM - puppet last run on labvirt1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:23:09] PROBLEM - puppet last run on elastic1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:23:09] PROBLEM - puppet last run on db1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:23:19] PROBLEM - puppet last run on wtp1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:23:19] PROBLEM - puppet last run on kafka1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:23:19] PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:23:19] PROBLEM - puppet last run on cp1071 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:23:30] PROBLEM - puppet last run on ms-be1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:23:39] PROBLEM - puppet last run on labvirt1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:28:46] Could not retrieve catalog from remote server: Error 400 on SERVER: Failed to submit 'replace facts' command for radium.wikimedia.org to PuppetDB at nitrogen.eqiad.wmnet:443: [502 Bad Gateway] [17:28:50] hmm [17:30:36] seems intermittent [17:30:40] RECOVERY - puppet last run on radium is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [17:31:03] ^ from a manual puppet agent -t [17:46:57] herron, Wikipedia is down for me. curl just hangs [17:48:10] matt_flaschen: what url? works for me [17:48:26] greg-g, https://en.wikipedia.org/wiki/Template:edit%20request , but all. [17:48:33] curl: (7) Failed to connect to en.wikipedia.org port 443: Connection timed out [17:48:37] matt_flaschen: traceroute would probably be useful [17:48:41] mtr or whatever [17:49:09] it is something on the network between you and the servers, most likely [17:49:33] greg-g, it seems to be Varnish not setting up the HTTPS correctly, but I'm not sure. [17:49:51] works for me. [17:50:05] Actually, the traceroute is indeed weird, though I don't know what it normally is. [17:50:59] RECOVERY - puppet last run on labsdb1011 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:51:01] Phabricator too. [17:51:08] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [17:51:09] RECOVERY - puppet last run on elastic1034 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [17:51:13] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga check for sysctl settings - https://phabricator.wikimedia.org/T160060#3647254 (10Dzahn) You can use releases2001.codfw.wmnet (failover) ,iridium.eqiad.wmnet (to be decom'ed), osmium.eqiad.wmnet (spare), all "ms-*" machines using role::spare,... [17:51:19] RECOVERY - puppet last run on labvirt1016 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [17:51:25] matt_flaschen I am actually seeing the same thing as you on my main connection which is verizon [17:51:28] RECOVERY - puppet last run on elastic1036 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [17:51:38] RECOVERY - puppet last run on kafka1022 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [17:51:38] RECOVERY - puppet last run on cp1071 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:51:38] RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:51:44] herron, same, I'll call them as well, but my traceroute is https://pastebin.com/XnWTHv5Y . [17:51:48] RECOVERY - puppet last run on analytics1051 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:51:49] RECOVERY - puppet last run on labvirt1012 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [17:51:49] I'm on FiOS. [17:51:57] same here, fios [17:51:58] RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [17:51:58] RECOVERY - puppet last run on ms-be1032 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [17:51:59] RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [17:52:08] RECOVERY - puppet last run on db1020 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [17:52:08] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [17:52:22] site loads fine on backup connection [17:53:13] herron: re: canary/test servers for new monitoring, use anything that has role(spare::system) in site.pp basically [17:53:26] or releases2001 is also cool [17:53:37] mutante ok great thanks [17:56:58] RECOVERY - puppet last run on wtp1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [17:59:30] (03CR) 10Paladox: [C: 031] releases: add missing Jenkins proxy setup [puppet] - 10https://gerrit.wikimedia.org/r/381473 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [18:04:30] !log update catchpoing test for stream.wikimedia.org to watch https://stream.wikimedia.org/?doc [18:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:49] ACKNOWLEDGEMENT - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Telia Planned Work PWIC78418 [18:04:58] RoanKattouw will be deploying a couple reverts in a bit. It's OK. :) [18:05:16] Not related to Verizon, to fix broken icons in Internet Explorer [18:05:24] On the phone troubleshooting with them. [18:06:26] !log remove grid-start-precise catchpoint check [18:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:59] !log remove Labs puppetmaster eqiad catchpoint check (for Toolforge) [18:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:40] !log remove labsdb1002 checks on catchpoint (Toolforge) [18:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:17] !log remove 'ToolLabs webservice - lighttpd on precise' catchpoint check for Catchpoint [18:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:24] Got them to escalate to network techs. [18:15:55] (03CR) 10Thcipriani: [C: 031] "I don't think there is a need for this anymore. I am not aware of anything/anyone that/who relies on this." [puppet] - 10https://gerrit.wikimedia.org/r/379502 (owner: 10Muehlenhoff) [18:19:31] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/8109/releases1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/381473 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [18:19:55] !log catrope@tin Synchronized php-1.31.0-wmf.1/includes/libs/CSSMin.php: T176884 (duration: 00m 47s) [18:19:58] (03PS3) 10Dzahn: releases: add missing Jenkins proxy setup [puppet] - 10https://gerrit.wikimedia.org/r/381473 (https://phabricator.wikimedia.org/T164030) [18:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:00] T176884: Icons missing throughout UI on Edge, IE 11 - https://phabricator.wikimedia.org/T176884 [18:23:21] Back up [18:34:29] PROBLEM - Host labvirt1017 is DOWN: PING CRITICAL - Packet loss = 100% [18:34:48] still inaccessible for me... [18:35:18] RECOVERY - Host labvirt1017 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [18:36:02] dbrant, yeah, it went back down. [18:36:10] Calling back [18:42:10] RoanKattouw: just force merge around the phpcs-docker issues, I'll fix it in a bit, sorry about htat [18:43:39] I did already [18:46:02] ok [18:55:44] Thanks for looking into it [18:57:18] (03PS1) 10Ottomata: [WIP] Prometheus based Kafka broker alerts, take 1 [puppet] - 10https://gerrit.wikimedia.org/r/381489 (https://phabricator.wikimedia.org/T175923) [18:57:54] herron, dbrant, it went back up before they brought in the network technicans this time. Is it working for you? [18:57:55] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Prometheus based Kafka broker alerts, take 1 [puppet] - 10https://gerrit.wikimedia.org/r/381489 (https://phabricator.wikimedia.org/T175923) (owner: 10Ottomata) [18:57:57] I'm hoping it sticks. [18:58:26] matt_flaschen: yep! back up for me. [18:58:49] thx! [18:58:52] Cool [19:00:17] (03PS2) 10Ottomata: [WIP] Prometheus based Kafka broker alerts, take 1 [puppet] - 10https://gerrit.wikimedia.org/r/381489 (https://phabricator.wikimedia.org/T175923) [19:00:46] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Prometheus based Kafka broker alerts, take 1 [puppet] - 10https://gerrit.wikimedia.org/r/381489 (https://phabricator.wikimedia.org/T175923) (owner: 10Ottomata) [19:03:42] (03PS3) 10Ottomata: [WIP] Prometheus based Kafka broker alerts, take 1 [puppet] - 10https://gerrit.wikimedia.org/r/381489 (https://phabricator.wikimedia.org/T175923) [19:04:14] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Prometheus based Kafka broker alerts, take 1 [puppet] - 10https://gerrit.wikimedia.org/r/381489 (https://phabricator.wikimedia.org/T175923) (owner: 10Ottomata) [19:05:32] (03PS4) 10Ottomata: [WIP] Prometheus based Kafka broker alerts, take 1 [puppet] - 10https://gerrit.wikimedia.org/r/381489 (https://phabricator.wikimedia.org/T175923) [19:07:20] working here as well :) [19:08:45] 10Operations, 10netops: Implement RPKI (Resource Public Key Infrastructure) - https://phabricator.wikimedia.org/T61115#3647470 (10BBlack) RFC 8205 (BGPSec) got published this week, which will use RPKI to secure against bad route announcements by signing UPDATE messages - https://tools.ietf.org/html/rfc8205 [19:15:09] PROBLEM - MariaDB Slave Lag: s4 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 621.42 seconds [19:17:11] (03PS5) 10Ottomata: [WIP] Prometheus based Kafka broker alerts, take 1 [puppet] - 10https://gerrit.wikimedia.org/r/381489 (https://phabricator.wikimedia.org/T175923) [19:19:28] (03PS6) 10Ottomata: [WIP] Prometheus based Kafka broker alerts, take 1 [puppet] - 10https://gerrit.wikimedia.org/r/381489 (https://phabricator.wikimedia.org/T175923) [19:22:18] RECOVERY - MariaDB Slave Lag: s4 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 197.67 seconds [19:22:48] (03PS7) 10Ottomata: [WIP] Prometheus based Kafka broker alerts, take 1 [puppet] - 10https://gerrit.wikimedia.org/r/381489 (https://phabricator.wikimedia.org/T175923) [19:23:39] !log completed running cirrussearch category moves on commonswiki [19:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:13] (03PS8) 10Ottomata: Prometheus based Kafka broker alerts, take 1 [puppet] - 10https://gerrit.wikimedia.org/r/381489 (https://phabricator.wikimedia.org/T175923) [19:26:01] (03CR) 10Ottomata: "Luca, lemme know what you think. I moved the jmx exporter stuff into a ::monitoring class, and put icinga stuff in there too." [puppet] - 10https://gerrit.wikimedia.org/r/381489 (https://phabricator.wikimedia.org/T175923) (owner: 10Ottomata) [19:28:14] (03PS1) 10Jcrespo: mariadb: backup user must be dump to match already in use mysql account [puppet] - 10https://gerrit.wikimedia.org/r/381491 (https://phabricator.wikimedia.org/T169516) [19:29:13] (03PS9) 10Ottomata: Prometheus based Kafka broker alerts, take 1 [puppet] - 10https://gerrit.wikimedia.org/r/381489 (https://phabricator.wikimedia.org/T175923) [19:32:17] (03CR) 10Jcrespo: [C: 032] mariadb: backup user must be dump to match already in use mysql account [puppet] - 10https://gerrit.wikimedia.org/r/381491 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo) [19:37:58] 10Operations, 10Goal: Improve database backups' coverage, monitoring and data recovery time (part 1) (tracking) - https://phabricator.wikimedia.org/T169658#3647504 (10jcrespo) [19:38:00] 10Operations, 10DBA: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3647505 (10jcrespo) [19:43:49] 10Operations, 10DBA: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3647509 (10jcrespo) Part of the work (the skeleton) has been done T169516, but that is far frome perfect, with more FIXMES than good code. That should be now fixed and coverted... [19:45:31] 10Operations, 10DBA: Create less overhead on bacula jobs when dumping production databases - https://phabricator.wikimedia.org/T162789#3647513 (10jcrespo) Clarification, the CC is not intended as a "you should do that", but as an apology for the current state, and that it will be done better soon. Thanks to ev... [19:47:41] (03PS1) 10Nuria: Removing schema that no longer exists [puppet] - 10https://gerrit.wikimedia.org/r/381493 (https://phabricator.wikimedia.org/T171629) [19:47:48] (03PS2) 10Ottomata: Add LVS service for druid-analytics-broker [puppet] - 10https://gerrit.wikimedia.org/r/378956 (https://phabricator.wikimedia.org/T176223) [20:02:34] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#3647544 (10cwdent) 05Open>03Resolved Closing this, there are a few subtasks left but we have parity with ganglia [20:04:41] 10Operations, 10ops-ulsfo, 10Traffic, 10hardware-requests, 10Patch-For-Review: Decom cp4005-8,13-16 (8 nodes) - https://phabricator.wikimedia.org/T176366#3647562 (10BBlack) a:05BBlack>03RobH [20:05:50] 10Operations, 10ops-ulsfo, 10Traffic, 10hardware-requests, 10Patch-For-Review: Decom cp4005-8,13-16 (8 nodes) - https://phabricator.wikimedia.org/T176366#3622887 (10BBlack) @RobH - these are good to go for decom now. They're still booted, but have been depooled, removed from confd/lvs/etc, re-roled in p... [20:11:20] !log restbase2001 - closing screen sessions that weren't running anything anymore [20:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:57] !log install1002/install2002 - closing unused screen sessions (except a tail -f) [20:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:06] !log mwlog1001, oxygen - closing unused screen session [20:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:45] Sees more than 10 users with screen sessions on terbium. So it's common to manually run maintenance scripts. If you can think of any that could be replaced by puppet cron, let me know. [20:42:52] (03PS1) 10Dzahn: screen-monitor: whitelist mediawiki maintenance servers [puppet] - 10https://gerrit.wikimedia.org/r/381501 (https://phabricator.wikimedia.org/T165348) [20:44:35] (03CR) 10Dzahn: [C: 032] screen-monitor: whitelist mediawiki maintenance servers [puppet] - 10https://gerrit.wikimedia.org/r/381501 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [20:48:01] (03PS1) 10Dzahn: screen-monitor: raise WARN threshold to 24 hours [puppet] - 10https://gerrit.wikimedia.org/r/381502 (https://phabricator.wikimedia.org/T165348) [20:49:11] (03CR) 10Dzahn: [C: 032] screen-monitor: raise WARN threshold to 24 hours [puppet] - 10https://gerrit.wikimedia.org/r/381502 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [20:49:13] (03CR) 10Andrew Bogott: [C: 031] screen-monitor: raise WARN threshold to 24 hours [puppet] - 10https://gerrit.wikimedia.org/r/381502 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [21:03:31] (03PS1) 10Dzahn: screen-monitor: whitelist wdqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/381504 (https://phabricator.wikimedia.org/T165348) [21:04:33] (03CR) 10Dzahn: [C: 032] screen-monitor: whitelist wdqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/381504 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [21:06:10] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [21:19:08] !log labstore2003/2004: closed idle screen sessions [21:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:06] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3647733 (10Dzahn) >>! In T165348#3646509, @herron wrote: > Today there are ~20 unhandled screen/tmux problems in icinga. Maybe this number will decrease after handling the init... [21:30:59] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [21:35:23] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3647760 (10Dzahn) >>! In T165348#3641893, @Volans wrote: > - I don't think puppetmasters should be whitelisted You have one running there yourself, still need it ?:) There are... [21:36:21] andrewbogott: new cron spam type. "*** Not found: /api/queues/%2F/notifications.error [21:36:39] from labtestcontrol2003 [21:36:43] mutante: closed mine ;) [21:36:46] since today, once per hour [21:36:57] volans: heh, still online :) thanks [21:37:07] was an idle one used for auditing the remote ipmi in the fleet ;) [21:37:15] I really think we should allow idle ones [21:37:20] i'm down to just 4 hosts left [21:37:39] \o/ [21:40:06] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3647765 (10Volans) @Dzahn Closed mine, thanks for noticing. [21:41:54] mutante: I'll look. That host shouldn't be doing much [21:45:21] (03PS1) 10Andrew Bogott: rabbitmq: redirect cron errors to logfile to avoid cronspam [puppet] - 10https://gerrit.wikimedia.org/r/381505 [21:46:01] mutante: ^ should fix it, if it's what I think it is [21:46:30] (03CR) 10Andrew Bogott: [C: 032] rabbitmq: redirect cron errors to logfile to avoid cronspam [puppet] - 10https://gerrit.wikimedia.org/r/381505 (owner: 10Andrew Bogott) [21:46:50] andrewbogott: yes, +1 but was alreayd closed :) thanks [21:46:58] looks very much like it [21:49:27] TIL, if you want to spy what is happening in another users' screen, and you don't want to mess with the tty permissions either, sudo su otheruser, run "script /dev/null" , now you can join their screen :p [22:46:15] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 2 others: Support git-lfs files in gerrit - https://phabricator.wikimedia.org/T171758#3647927 (10greg) Adding our #releng-kanban project as we would like to work on this in the coming quarter or two (no promises though, this is not a "goal" only... [23:42:57] 10Operations, 10Wiki-Loves-Monuments (2017): Import Wiki Loves Monuments photos from Flickr to Commons - https://phabricator.wikimedia.org/T173056#3517417 (10LilyOfTheWest) @fgiunchedi a quick note that Multichill and I did some assessments of the number of photos we can transfer from Flickr to Commons as part...