[00:00:34] RECOVERY - Filesystem available is greater than filesystem size on ms-be1041 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be1041&var-datasource=eqiad%2520prometheus%252Fops [00:01:04] RECOVERY - Tor DirPort on torrelay1001 is OK: TCP OK - 0.000 second response time on 208.80.154.9 port 9032 [00:02:06] (03PS1) 10Alex Monk: Change secure opener mode to 640 [software/certcentral] - 10https://gerrit.wikimedia.org/r/458933 [00:03:24] PROBLEM - Check systemd state on torrelay1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:03:30] (03CR) 10jerkins-bot: [V: 04-1] Change secure opener mode to 640 [software/certcentral] - 10https://gerrit.wikimedia.org/r/458933 (owner: 10Alex Monk) [00:03:36] (03CR) 10Alex Monk: "Some relevant discussion at https://phabricator.wikimedia.org/T199711#4551409" [software/certcentral] - 10https://gerrit.wikimedia.org/r/458933 (owner: 10Alex Monk) [00:04:14] 10Operations: Trying to install updated versions of "linux-meta linux-meta-4.9" fails - https://phabricator.wikimedia.org/T203851 (10Paladox) This is on stretch. [00:04:34] RECOVERY - Check systemd state on torrelay1001 is OK: OK - running: The system is fully operational [00:07:26] (03CR) 10Dzahn: [C: 032] replace radium with torrelay1001 as tor-eqiad-1.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/458839 (https://phabricator.wikimedia.org/T196701) (owner: 10Dzahn) [00:07:53] PROBLEM - Check systemd state on torrelay1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:08:03] (03CR) 10Dzahn: [C: 032] "i see traffic on torrelay with "arm" and using the hashed controller password.. servics up. looks good" [dns] - 10https://gerrit.wikimedia.org/r/458839 (https://phabricator.wikimedia.org/T196701) (owner: 10Dzahn) [00:08:17] (03PS2) 10Alex Monk: Change secure opener mode to 640 [software/certcentral] - 10https://gerrit.wikimedia.org/r/458933 [00:08:30] (03PS1) 10Legoktm: planet: Add Wikimedia Security Team blog to en [puppet] - 10https://gerrit.wikimedia.org/r/458935 [00:11:04] RECOVERY - Check systemd state on torrelay1001 is OK: OK - running: The system is fully operational [00:13:54] (03PS1) 10Alex Monk: api: Make OSE a lot louder [software/certcentral] - 10https://gerrit.wikimedia.org/r/458936 [00:16:30] !log tor relay switched over from radium to torrelay1001, fixed /var/lib/tor permissions, restarted service, flipped DNS CNAME (5M TTL), traffic can be seen with "arm", monitoring all green (T196701) [00:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:36] T196701: rack/setup/install torrelay1001.wikimedia.org - https://phabricator.wikimedia.org/T196701 [00:17:02] (03PS1) 10Alex Monk: Push out all chaining types for initial certs [software/certcentral] - 10https://gerrit.wikimedia.org/r/458939 [00:18:24] (03CR) 10jerkins-bot: [V: 04-1] Push out all chaining types for initial certs [software/certcentral] - 10https://gerrit.wikimedia.org/r/458939 (owner: 10Alex Monk) [00:18:37] !log to watch what is happenin on torrelay1001 - sudo -u debian-tor arm - if asked for password it's in passwords::tor in private [00:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:18] 10Operations, 10Patch-For-Review, 10Tor: rack/setup/install torrelay1001.wikimedia.org - https://phabricator.wikimedia.org/T196701 (10Dzahn) [00:24:33] RECOVERY - Disk space on elastic1024 is OK: DISK OK [00:27:50] !log torrelay1001 - reset internal state (sighup) with "arm" and pressing x twice [00:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:30] (03PS1) 10Catrope: Revert "labs: Set wgChangeTagsSchemaMigrationStage to MIGRATION_NEW" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458942 [00:28:50] (03PS2) 10Catrope: Revert "labs: Set wgChangeTagsSchemaMigrationStage to MIGRATION_NEW" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458942 (https://phabricator.wikimedia.org/T196671) [00:28:59] (03CR) 10Catrope: [C: 032] Revert "labs: Set wgChangeTagsSchemaMigrationStage to MIGRATION_NEW" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458942 (https://phabricator.wikimedia.org/T196671) (owner: 10Catrope) [00:30:15] (03Merged) 10jenkins-bot: Revert "labs: Set wgChangeTagsSchemaMigrationStage to MIGRATION_NEW" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458942 (https://phabricator.wikimedia.org/T196671) (owner: 10Catrope) [00:34:20] (03PS2) 10Alex Monk: Push out all chaining types for initial certs [software/certcentral] - 10https://gerrit.wikimedia.org/r/458939 [00:36:02] (03CR) 10jerkins-bot: [V: 04-1] Push out all chaining types for initial certs [software/certcentral] - 10https://gerrit.wikimedia.org/r/458939 (owner: 10Alex Monk) [00:39:10] (03CR) 10Dzahn: [C: 032] planet: Add Wikimedia Security Team blog to en [puppet] - 10https://gerrit.wikimedia.org/r/458935 (owner: 10Legoktm) [00:39:53] (03PS2) 10Dzahn: Revert "tor_relay: temp allow rsync of datadir for migration" [puppet] - 10https://gerrit.wikimedia.org/r/456049 [00:40:09] (03CR) 10Dzahn: [C: 032] Revert "tor_relay: temp allow rsync of datadir for migration" [puppet] - 10https://gerrit.wikimedia.org/r/456049 (owner: 10Dzahn) [00:46:19] (03PS1) 10Dzahn: tor: enable logging at 'notice' level (recommended) [puppet] - 10https://gerrit.wikimedia.org/r/458944 (https://phabricator.wikimedia.org/T196701) [00:46:48] (03PS3) 10Alex Monk: Push out all chaining types for initial certs [software/certcentral] - 10https://gerrit.wikimedia.org/r/458939 [00:47:52] (03PS2) 10Dzahn: tor: enable logging at 'notice' level (recommended) [puppet] - 10https://gerrit.wikimedia.org/r/458944 (https://phabricator.wikimedia.org/T196701) [00:48:24] (03CR) 10Dzahn: [C: 032] tor: enable logging at 'notice' level (recommended) [puppet] - 10https://gerrit.wikimedia.org/r/458944 (https://phabricator.wikimedia.org/T196701) (owner: 10Dzahn) [00:48:30] (03CR) 10jerkins-bot: [V: 04-1] Push out all chaining types for initial certs [software/certcentral] - 10https://gerrit.wikimedia.org/r/458939 (owner: 10Alex Monk) [00:52:43] PROBLEM - Check systemd state on torrelay1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:52:58] (03PS4) 10Alex Monk: Push out all chaining types for initial certs [software/certcentral] - 10https://gerrit.wikimedia.org/r/458939 [00:53:29] !log radium - stopping rsync.service [00:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:36] (03CR) 10jenkins-bot: Revert "labs: Set wgChangeTagsSchemaMigrationStage to MIGRATION_NEW" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/458942 (https://phabricator.wikimedia.org/T196671) (owner: 10Catrope) [00:54:12] (03CR) 10jerkins-bot: [V: 04-1] Push out all chaining types for initial certs [software/certcentral] - 10https://gerrit.wikimedia.org/r/458939 (owner: 10Alex Monk) [00:57:23] 10Operations, 10Tor: decom radium - https://phabricator.wikimedia.org/T203861 (10Dzahn) p:05Triage>03Normal [00:57:50] 10Operations, 10Tor: decom radium - https://phabricator.wikimedia.org/T203861 (10Dzahn) [00:59:31] (03PS1) 10Dzahn: site: turn radium into a spare system [puppet] - 10https://gerrit.wikimedia.org/r/458946 (https://phabricator.wikimedia.org/T196701) [01:00:28] 10Operations, 10Patch-For-Review, 10Tor: decom radium - https://phabricator.wikimedia.org/T203861 (10Dzahn) turning it into a spare::system already to remove unused Icinga monitoring, stop the rsync service via puppet etc [01:00:47] 10Operations, 10Patch-For-Review, 10Tor: rack/setup/install torrelay1001.wikimedia.org - https://phabricator.wikimedia.org/T196701 (10Dzahn) [01:00:49] 10Operations, 10Patch-For-Review, 10Tor: decom radium - https://phabricator.wikimedia.org/T203861 (10Dzahn) 05Open>03stalled [01:01:22] (03PS2) 10Dzahn: site: turn radium into a spare system [puppet] - 10https://gerrit.wikimedia.org/r/458946 (https://phabricator.wikimedia.org/T196701) [01:02:23] (03CR) 10Dzahn: [C: 032] site: turn radium into a spare system [puppet] - 10https://gerrit.wikimedia.org/r/458946 (https://phabricator.wikimedia.org/T196701) (owner: 10Dzahn) [01:02:25] (03PS5) 10Alex Monk: Push out all chaining types for initial certs [software/certcentral] - 10https://gerrit.wikimedia.org/r/458939 [01:03:35] (03CR) 10jerkins-bot: [V: 04-1] Push out all chaining types for initial certs [software/certcentral] - 10https://gerrit.wikimedia.org/r/458939 (owner: 10Alex Monk) [01:07:46] 10Operations, 10LDAP, 10Security: Have a check to prevent non-existent accounts from being added to LDAP groups - https://phabricator.wikimedia.org/T201779 (10Legoktm) @MoritzMuehlenhoff I think this needs to be something integrated into whatever tool is being used to add people to LDAP rather than something... [01:10:35] !log also rsyncing /var/lib/tor-instances/ data for second instance and restarting service (T196701) [01:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:41] T196701: rack/setup/install torrelay1001.wikimedia.org - https://phabricator.wikimedia.org/T196701 [01:29:23] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:31:45] (03PS1) 10Dzahn: Revert "tor: enable logging at 'notice' level (recommended)" [puppet] - 10https://gerrit.wikimedia.org/r/458955 [01:32:10] (03CR) 10Dzahn: "this causes a conflict when the secondary instance wants to write to the same logfile as the primary one..." [puppet] - 10https://gerrit.wikimedia.org/r/458955 (owner: 10Dzahn) [01:32:20] (03PS2) 10Dzahn: Revert "tor: enable logging at 'notice' level (recommended)" [puppet] - 10https://gerrit.wikimedia.org/r/458955 [01:32:26] (03CR) 10Dzahn: [C: 032] Revert "tor: enable logging at 'notice' level (recommended)" [puppet] - 10https://gerrit.wikimedia.org/r/458955 (owner: 10Dzahn) [01:34:03] RECOVERY - Check systemd state on torrelay1001 is OK: OK - running: The system is fully operational [01:34:24] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:46:44] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:51:53] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [01:55:14] 10Operations, 10Patch-For-Review, 10Tor: rack/setup/install torrelay1001.wikimedia.org - https://phabricator.wikimedia.org/T196701 (10Dzahn) https://metrics.torproject.org/rs.html#details/DB19E709C9EDB903F75F2E6CA95C84D637B62A02 [01:59:18] 10Operations, 10Patch-For-Review, 10Tor: rack/setup/install torrelay1001.wikimedia.org - https://phabricator.wikimedia.org/T196701 (10Dzahn) wikimedia-eqiad1 is fine. unfortunately the secondary one, wikimediaeqiad2 got started with a different fingerprint at first (synced /var/lib/tor but not /var/lib/to... [01:59:40] 10Operations, 10Patch-For-Review, 10Tor: rack/setup/install torrelay1001.wikimedia.org - https://phabricator.wikimedia.org/T196701 (10Dzahn) 05Open>03Resolved [02:03:18] (03PS8) 10Alex Monk: Debian packaging [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/458554 [02:09:15] (03CR) 10Alex Monk: [C: 04-1] "needs an argument for directory URL" [software/certcentral] - 10https://gerrit.wikimedia.org/r/457933 (owner: 10Alex Monk) [02:12:04] (03PS34) 10Alex Monk: [WIP] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) [02:13:39] 10Operations, 10LDAP, 10Security: Have a check to prevent non-existent accounts from being added to LDAP groups - https://phabricator.wikimedia.org/T201779 (10Dzahn) This lasted mere seconds. Shouldn't have logged it before confirming. [02:20:34] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [02:21:21] (03CR) 10Dzahn: "using pip instead of APT conflicts with https://phabricator.wikimedia.org/L3 so this could not be used in production. whether this is app" [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) (owner: 10Zhuyifei1999) [02:23:20] (03CR) 10Dzahn: [C: 04-1] "i dont think "exec pip" from puppet is a good pattern. what makes you want that instead of using normal packages" [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) (owner: 10Zhuyifei1999) [02:25:02] (03CR) 10Dzahn: [C: 04-1] "if pip is really needed for some reason then there is a puppet pip provider. that would still be better than exec. https://puppet.com/docs" [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) (owner: 10Zhuyifei1999) [02:25:43] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [02:32:57] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [02:42:57] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 17 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [02:59:47] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:00:17] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [03:00:37] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [03:01:48] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:08:48] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [03:09:17] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [03:10:28] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [03:14:32] 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10Peachey88) [03:15:38] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [03:26:17] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 808.29 seconds [03:43:08] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [03:50:07] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 274.16 seconds [03:53:18] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:05:20] (03CR) 10Zhuyifei1999: "> i dont think "exec pip" from puppet is a good pattern." [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) (owner: 10Zhuyifei1999) [04:05:38] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:06:19] (03PS1) 10Krinkle: profiler: Move vendor/autoload from MWMultiVersion to profiler.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459013 [04:09:29] (03PS2) 10Krinkle: profiler: Move vendor/autoload from MWMultiVersion to profiler.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459013 (https://phabricator.wikimedia.org/T189966) [04:10:47] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:16:08] (03CR) 10Krinkle: [C: 032] profiler: Move vendor/autoload from MWMultiVersion to profiler.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459013 (https://phabricator.wikimedia.org/T189966) (owner: 10Krinkle) [04:16:51] !log krinkle@deploy1001 Synchronized wmf-config/profiler.php: Ia27a8f7ed612f (duration: 00m 54s) [04:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:18] (03Merged) 10jenkins-bot: profiler: Move vendor/autoload from MWMultiVersion to profiler.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459013 (https://phabricator.wikimedia.org/T189966) (owner: 10Krinkle) [04:17:48] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:20:26] (03CR) 10Zhuyifei1999: "I think for a production service like ORES they use extensive deployment systems like scap invoking wheel builds in order to have self-bui" [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) (owner: 10Zhuyifei1999) [04:22:36] !log krinkle@deploy1001 Synchronized multiversion/: Ia27a8f7ed612f (duration: 00m 49s) [04:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:22:57] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:23:32] (03CR) 10jenkins-bot: profiler: Move vendor/autoload from MWMultiVersion to profiler.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/459013 (https://phabricator.wikimedia.org/T189966) (owner: 10Krinkle) [04:35:18] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:42:07] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:42:17] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [04:42:48] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [04:44:38] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [04:45:18] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [04:45:18] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [04:45:28] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [04:51:07] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [04:51:28] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [05:07:47] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [05:12:47] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 17 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [05:34:29] (03CR) 10Zhuyifei1999: "Regarding the pip provider not supporting venvs, there are https://projects.puppetlabs.com/issues/7286 => https://tickets.puppetlabs.com/b" [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) (owner: 10Zhuyifei1999) [06:29:58] PROBLEM - puppet last run on analytics1071 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf] [06:30:48] PROBLEM - puppet last run on wdqs1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [06:31:28] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [06:32:27] PROBLEM - puppet last run on ganeti1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/field.sh] [06:51:57] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 23 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [06:55:18] RECOVERY - puppet last run on analytics1071 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:56:18] RECOVERY - puppet last run on wdqs1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:47] RECOVERY - puppet last run on ganeti1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:01:58] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:16:28] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:26:38] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:33:58] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [07:39:08] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [08:36:27] (03CR) 10Gehel: [C: 04-1] "Looks good! Minor comments inline" (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/458891 (owner: 10Mathew.onipe) [08:57:28] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [09:07:38] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 17 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [09:24:17] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1481 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [09:28:38] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [09:30:57] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [09:30:58] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.0462 https://grafana.wikimedia.org/dashboard/db/logstash [09:45:08] !log tools restarted cron and truncated /var/log/exim4/paniclog (T196137) [09:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:17] T196137: toolforge: prometheus issue is filling up email queue - https://phabricator.wikimedia.org/T196137 [09:45:18] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [09:50:28] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 17 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [10:28:42] gtirloni: I think you were ment to log that in #wikimedia-cloud unless you were told to do it in here in which case ignore me :) [10:30:33] paladox: yeah, I think that makes sense, thanks :) [10:31:06] Your welcome :) [10:36:44] 10Operations, 10ops-eqiad, 10Analytics: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10elukey) I like the racking proposal! Please also remember that it needs to be put into the Analytics VLAN :) [10:44:28] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 21 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [10:45:57] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:50:08] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [10:52:28] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:54:37] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [10:56:28] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:58:38] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [11:04:47] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 17 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:38:27] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:41:38] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [13:53:42] (03CR) 10Framawiki: ">using pip instead of APT conflicts with https://phabricator.wikimedia.org/L3 so this could not be used in production. whether this is ap" [puppet] - 10https://gerrit.wikimedia.org/r/451698 (https://phabricator.wikimedia.org/T192698) (owner: 10Zhuyifei1999) [14:36:28] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [14:41:37] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [15:25:28] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [15:30:38] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 17 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [15:54:08] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 746.97 seconds [16:03:58] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 266.50 seconds [16:28:11] 10Operations, 10Mail: Implement MTA-STS - https://phabricator.wikimedia.org/T203883 (10faidon) p:05Triage>03Normal [17:09:48] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [17:14:57] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 17 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [18:07:58] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [18:12:58] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [18:17:39] (03CR) 10Krinkle: [C: 031] mediawiki::web::prod_sites: enable HHVM on some sites(!!!) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/452325 (owner: 10Giuseppe Lavagetto) [18:20:17] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [18:20:47] (03CR) 10Krinkle: mediawiki::web::prod_sites: convert usability wiki (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/452635 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [18:25:18] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [19:04:58] PROBLEM - pdfrender on scb2004 is CRITICAL: connect to address 10.192.16.36 and port 5252: Connection refused [19:08:07] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [19:13:08] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 17 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [19:36:07] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:38:17] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:59:18] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [20:09:28] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 18 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [20:47:17] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [20:52:18] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 17 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [21:04:38] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [21:09:47] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [21:22:08] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 20 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [21:27:08] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 19 probes of 319 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [22:05:38] PROBLEM - Filesystem available is greater than filesystem size on ms-be2043 is CRITICAL: cluster=swift device=/dev/sdd1 fstype=xfs instance=ms-be2043:9100 job=node mountpoint=/srv/swift-storage/sdd1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2043&var-datasource=codfw%2520prometheus%252Fops [22:08:28] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 46.67% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:19:53] quite [22:20:09] Quit [22:20:16] :) [22:20:24] /quit [22:26:31] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10kaldari) @Imarlier - Have y'all tried the "Request Indexing" feature in Goog... [22:32:47] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1