[00:03:16] !log tried one more time on db2028,db2029, both trusty. on db2028: gmond was running as user ganglia-monitor, failed, had to manually kill the process, run puppet again then ok. on db2029, gmond was running as "499" but puppet just ran and removed it without manual intervention. (T177225) [00:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:26] T177225: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225 [00:18:38] (03PS1) 10Dzahn: ganglia: add decom bash script if on trusty (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/394727 [00:19:14] (03CR) 10jerkins-bot: [V: 04-1] ganglia: add decom bash script if on trusty (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/394727 (owner: 10Dzahn) [00:20:27] (03PS2) 10Dzahn: ganglia: add decom bash script if on trusty (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/394727 [00:20:56] (03CR) 10jerkins-bot: [V: 04-1] ganglia: add decom bash script if on trusty (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/394727 (owner: 10Dzahn) [00:22:41] (03PS3) 10Dzahn: ganglia: add decom bash script if on trusty (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/394727 (https://phabricator.wikimedia.org/T177225) [00:26:08] PROBLEM - MariaDB Slave SQL: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table dewiki.pagelinks: Cant find record in pagelinks, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1070-bin.001560, end_log_pos 1001648006 [00:29:23] (03PS1) 10Dzahn: site: convert "not really"-spare nodes to test nodes [puppet] - 10https://gerrit.wikimedia.org/r/394731 [00:31:05] (03PS2) 10Dzahn: site: convert "not true"-spare system to role(test) [puppet] - 10https://gerrit.wikimedia.org/r/394731 [00:31:27] (03PS3) 10Dzahn: site: convert "not true"-spare systems to role(test) [puppet] - 10https://gerrit.wikimedia.org/r/394731 [00:37:59] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 889.06 seconds [00:39:30] (03PS1) 10Dzahn: spare::systems: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394733 [00:41:22] (03CR) 10Dzahn: [C: 032] spare::systems: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394733 (owner: 10Dzahn) [00:50:12] (03PS1) 10Dzahn: test servers: remove ganglia [puppet] - 10https://gerrit.wikimedia.org/r/394734 [00:51:05] (03CR) 10Dzahn: [C: 032] "test and spare covers the entire "misc esams" as well" [puppet] - 10https://gerrit.wikimedia.org/r/394734 (owner: 10Dzahn) [00:57:04] 10Operations, 10monitoring, 10Patch-For-Review: Uninstall ganglia from the fleet - https://phabricator.wikimedia.org/T177225#3804968 (10Dzahn) db2023 - https://gerrit.wikimedia.org/r/394647 etcd::networking https://gerrit.wikimedia.org/r/394724 db2028,db2029 - https://gerrit.wikimedia.org/r/394725 spare syst... [01:14:19] RECOVERY - MariaDB Slave SQL: s5 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [01:21:09] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 2.38 seconds [02:34:34] (03PS17) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [03:24:29] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 781.89 seconds [03:51:21] !log Ran "scap pull" on snapshot1001, after final T181385 tests [03:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:51:32] T181385: Wikidata entity dumpers stuck with 100% CPU on snapshot1007 - https://phabricator.wikimedia.org/T181385 [04:02:39] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 256.61 seconds [05:05:31] PROBLEM - Debian mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/debian is over 14 hours old. [06:29:02] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/varnishreqstats] [06:29:21] PROBLEM - puppet last run on mw2229 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/systemd/system/nginx.service.d/security.conf] [06:32:01] PROBLEM - puppet last run on db2090 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:32:05] PROBLEM - puppet last run on mw2173 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/hphpd/hphpd.ini] [06:57:01] RECOVERY - puppet last run on db2090 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:02] RECOVERY - puppet last run on mw2173 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:01] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:21] RECOVERY - puppet last run on mw2229 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:01:31] PROBLEM - Check systemd state on mw1259 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:14:31] PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[jobchron] [08:17:51] PROBLEM - Check Varnish expiry mailbox lag on cp4026 is CRITICAL: CRITICAL: expiry mailbox lag is 2021057 [08:57:51] PROBLEM - Check Varnish expiry mailbox lag on cp4026 is CRITICAL: CRITICAL: expiry mailbox lag is 2060952 [09:01:40] 10Operations, 10Wikimedia-Blog, 10Patch-For-Review: add techblog.wikimedia.org redirection to blog.wikimedia.org to redirects - https://phabricator.wikimedia.org/T90638#1063823 (10Framawiki) See {T181878} too [09:08:55] (03PS1) 10Framawiki: Redirect techblog.wikimedia.org to blog.wikimedia.org/c/technology [puppet] - 10https://gerrit.wikimedia.org/r/394743 (https://phabricator.wikimedia.org/T181878) [10:37:53] RECOVERY - Check Varnish expiry mailbox lag on cp4026 is OK: OK: expiry mailbox lag is 11 [11:40:30] (03CR) 10Zoranzoki21: [C: 031] Redirect techblog.wikimedia.org to blog.wikimedia.org/c/technology [puppet] - 10https://gerrit.wikimedia.org/r/394743 (https://phabricator.wikimedia.org/T181878) (owner: 10Framawiki) [11:47:29] (03PS2) 10Zoranzoki21: Remove mysql module from WMF [puppet] - 10https://gerrit.wikimedia.org/r/391849 (https://phabricator.wikimedia.org/T162070) (owner: 10Jcrespo) [14:03:30] 10Puppet, 10Wikimedia-Apache-configuration, 10Wikimedia-Language-setup, 10Wiki-Setup (Close): Redirect several wikis - https://phabricator.wikimedia.org/T169450#3805576 (10MarcoAurelio) [14:44:07] (03PS3) 10ArielGlenn: move one more setting out of snapshot hiera and into profiles [puppet] - 10https://gerrit.wikimedia.org/r/393546 [14:49:14] (03PS4) 10ArielGlenn: move one more setting out of snapshot hiera and into profiles [puppet] - 10https://gerrit.wikimedia.org/r/393546 [14:50:16] (03Draft1) 10MarcoAurelio: gerritbot: bolden `merged` on Phabricator comments [puppet] - 10https://gerrit.wikimedia.org/r/394748 (https://phabricator.wikimedia.org/T181886) [14:50:21] (03PS2) 10MarcoAurelio: gerritbot: bolden `merged` on Phabricator comments [puppet] - 10https://gerrit.wikimedia.org/r/394748 (https://phabricator.wikimedia.org/T181886) [14:54:42] RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 0 hours old. [14:59:43] (03PS5) 10ArielGlenn: move one more setting out of snapshot hiera and into profiles [puppet] - 10https://gerrit.wikimedia.org/r/393546 [15:03:03] (03CR) 10ArielGlenn: [C: 032] move one more setting out of snapshot hiera and into profiles [puppet] - 10https://gerrit.wikimedia.org/r/393546 (owner: 10ArielGlenn) [15:07:29] (03PS4) 10ArielGlenn: move last hiera calls out of snapshot modules into profile [puppet] - 10https://gerrit.wikimedia.org/r/393547 [15:11:15] (03CR) 10ArielGlenn: [C: 032] move last hiera calls out of snapshot modules into profile [puppet] - 10https://gerrit.wikimedia.org/r/393547 (owner: 10ArielGlenn) [15:36:27] (03CR) 10Framawiki: [C: 031] gerritbot: bolden `merged` on Phabricator comments [puppet] - 10https://gerrit.wikimedia.org/r/394748 (https://phabricator.wikimedia.org/T181886) (owner: 10MarcoAurelio) [15:50:31] (03PS10) 10ArielGlenn: clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942) [16:24:48] (03PS1) 10ArielGlenn: Wikidata weekly json and rdf dumps disabled temporarily [puppet] - 10https://gerrit.wikimedia.org/r/394758 (https://phabricator.wikimedia.org/T181385) [16:26:05] (03CR) 10Hoo man: [C: 031] Wikidata weekly json and rdf dumps disabled temporarily [puppet] - 10https://gerrit.wikimedia.org/r/394758 (https://phabricator.wikimedia.org/T181385) (owner: 10ArielGlenn) [16:26:29] (03CR) 10ArielGlenn: [C: 032] Wikidata weekly json and rdf dumps disabled temporarily [puppet] - 10https://gerrit.wikimedia.org/r/394758 (https://phabricator.wikimedia.org/T181385) (owner: 10ArielGlenn) [16:31:12] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 3 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3805805 (10Halfak) I've de-converted all of our github repos so that we can continue work while we wait for {T180628} [16:41:12] 10Operations, 10Gerrit, 10ORES, 10Scoring-platform-team, and 3 others: Plan migration of ORES repos to git-lfs - https://phabricator.wikimedia.org/T181678#3805812 (10Halfak) I've re-enabled observation on: https://phabricator.wikimedia.org/source/editquality https://phabricator.wikimedia.org/source/draftqu... [17:00:52] (03PS9) 10Ema: varnish: prometheus equivalent of statsd metrics daemons [puppet] - 10https://gerrit.wikimedia.org/r/394543 (https://phabricator.wikimedia.org/T177199) [17:01:41] (03PS2) 10Ema: mtail: add varnishmtail tests [puppet] - 10https://gerrit.wikimedia.org/r/394597 (https://phabricator.wikimedia.org/T177199) [17:03:55] (03PS11) 10ArielGlenn: clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942) [17:05:04] (03PS12) 10ArielGlenn: clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942) [17:10:07] (03CR) 10ArielGlenn: [C: 032] clean up old misc dump output files from cron jobs on dump hosts [puppet] - 10https://gerrit.wikimedia.org/r/393245 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [17:24:29] (03PS10) 10Ema: varnish: prometheus equivalent of statsd metrics daemons [puppet] - 10https://gerrit.wikimedia.org/r/394543 (https://phabricator.wikimedia.org/T177199) [17:55:52] !log Reboot db1096.s5 to pick up the correct innodb_buffer_pool size after finishing compressing s5 - T178359 [17:56:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:03] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [18:20:06] (03PS1) 10ArielGlenn: simplify cleanup of old xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/394763 (https://phabricator.wikimedia.org/T181895) [18:22:21] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:25:47] :/ [18:30:52] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [18:51:47] huh [18:52:05] sigh [18:52:21] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:53:27] changed nothing, it was the random hiccup that happens on hosts from time to time [18:53:35] hoo: ^ [18:54:23] Ah ok… I thought it might be related to the cron changes :P [18:54:35] nope [18:54:59] remember I said I had to run puppet for those, manually remove the cron jobs, and run puppet again to make sure it didn't add them back [18:55:25] so nope [18:55:46] the other changesets are for cleanup of misc crons on the nfs/web servers [18:55:48] not the snapshots [18:56:12] and I ran a couple of those too (besides the usual puppet compiler runs, which don't get everything) [19:17:38] (03PS2) 10ArielGlenn: simplify cleanup of old xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/394763 (https://phabricator.wikimedia.org/T181895) [19:37:01] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [19:39:00] 10Operations, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Recover "Flominator" svn account for use as a modern developer account - https://phabricator.wikimedia.org/T180813#3805996 (10Flominator) @bd808 Worked like a charm, thanks a lot. Now I only have to understand how all of th... [19:42:32] PROBLEM - HHVM rendering on mw2123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:43:32] RECOVERY - HHVM rendering on mw2123 is OK: HTTP OK: HTTP/1.1 200 OK - 73582 bytes in 0.301 second response time [19:47:12] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 31.03% of data above the critical threshold [1800.0] https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1panelId=8fullscreen [20:15:36] (03PS1) 10Jon Harald Søby: Translate name of Wiktionary in Wallon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394771 (https://phabricator.wikimedia.org/T181782) [20:24:20] (03PS2) 10Framawiki: Localize sitename and meta NS for wawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/394771 (https://phabricator.wikimedia.org/T181782) (owner: 10Jon Harald Søby) [20:55:32] RECOVERY - High lag on wdqs1003 is OK: OK: Less than 30.00% above the threshold [600.0] https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1panelId=8fullscreen [20:59:39] (03PS3) 10ArielGlenn: simplify cleanup of old xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/394763 (https://phabricator.wikimedia.org/T181895) [21:28:42] AaronSchulz: thanks for picking up the statsd buffer issue so quickly [22:12:49] (03CR) 10MarcoAurelio: "I think, from what Reedy told me with regards to another patch I uploaded, that you now need to run a script to convert that .dat data int" [puppet] - 10https://gerrit.wikimedia.org/r/394743 (https://phabricator.wikimedia.org/T181878) (owner: 10Framawiki) [22:13:17] Reedy, ^^ [22:13:21] is that right? [22:13:45] compile_redirects.rb I think it was called [22:58:35] (03CR) 10Reedy: [C: 04-1] "Yup" [puppet] - 10https://gerrit.wikimedia.org/r/394743 (https://phabricator.wikimedia.org/T181878) (owner: 10Framawiki)