[00:00:00] woo [00:00:02] no_justification ^^ [00:00:04] it worked [00:00:04] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:01:36] oh https://gerrit.wikimedia.org/r/#/q/is:wip that was alot of drafts :) [00:02:43] RECOVERY - puppet last run on db2095 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:03:03] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:03:13] RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:03:33] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:03:53] RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:04:23] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [00:04:54] RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:05:13] RECOVERY - puppet last run on db1124 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [00:05:14] RECOVERY - puppet last run on labsdb1011 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:05:23] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:08:13] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:08:43] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:08:44] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:10:33] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:10:33] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:10:33] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:11:43] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [00:13:43] RECOVERY - puppet last run on webperf2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:14:14] RECOVERY - puppet last run on db1125 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:15:35] (03PS3) 10Paladox: Add icinga2 [puppet] - 10https://gerrit.wikimedia.org/r/351540 [00:15:54] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [00:16:03] (that was a private comment, we should probaley get wikibugs to ignore WIP changes) [00:16:17] comment = commit (ie draft but was converted to wip) [00:16:56] doesn't wikibugs use a public feed? [00:17:04] RECOVERY - puppet last run on webperf1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [00:17:12] yep, but i mean ignore the WIP changes [00:17:29] since there could be many changes to WIP thus could spam channels [00:18:44] There's suddenly not going to be many more commits because there's a new feature for WIP [00:18:48] 10Operations: Deactivate Chad's Racktables account - https://phabricator.wikimedia.org/T196787#4268451 (10demon) p:05Triage>03Normal [00:21:09] Reedy nope but could be quite annoying, if you work on something an and it spams the channel. [00:21:25] That happens now? [00:21:37] if you have open changes [00:21:42] but work around was to use draft [00:22:13] RECOVERY - puppet last run on db1102 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:25:08] Few people actually used drafts for more than the initial staging of creating a change via the web [00:25:29] i used it as i got told off in here for creating spammy changes. [00:25:43] RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [00:39:07] (03PS4) 10Alex Monk: Allow PuppetDB use on standalone puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/435631 (https://phabricator.wikimedia.org/T194962) [00:39:23] PROBLEM - SSH cp3042.mgmt on cp3042.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:39:23] RECOVERY - SSH cp3042.mgmt on cp3042.mgmt is OK: SSH OK - OpenSSH_5.8 (protocol 2.0) [02:05:16] (03CR) 10Alex Monk: "Obsoleted by Ia65009dc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387570 (https://phabricator.wikimedia.org/T179371) (owner: 10Filippo Giunchedi) [02:06:24] (03CR) 10Alex Monk: "Obsoleted by I411fcef3" [puppet] - 10https://gerrit.wikimedia.org/r/387579 (https://phabricator.wikimedia.org/T179371) (owner: 10Filippo Giunchedi) [02:07:27] (03CR) 10Alex Monk: "Obsoleted by I411fcef3 ?" [puppet] - 10https://gerrit.wikimedia.org/r/386869 (https://phabricator.wikimedia.org/T179371) (owner: 10Filippo Giunchedi) [02:12:23] 10Operations, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Move deployment-prep redis instances to stretch - https://phabricator.wikimedia.org/T179371#4268622 (10Krenair) It looks like @joe has made and merged patches that essentially obsolete tho... [03:23:57] (03PS5) 10Sau226: Implementing Patroller User Rights for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437777 (https://phabricator.wikimedia.org/T196488) [03:24:33] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 785.17 seconds [04:22:32] PROBLEM - Device not healthy -SMART- on db1065 is CRITICAL: cluster=mysql device=megaraid,1 instance=db1065:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1065&var-datasource=eqiad%2520prometheus%252Fops [04:41:53] RECOVERY - Device not healthy -SMART- on rdb1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=rdb1004&var-datasource=eqiad%2520prometheus%252Fops [05:26:52] PROBLEM - Device not healthy -SMART- on db1063 is CRITICAL: cluster=mysql device=megaraid,3 instance=db1063:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1063&var-datasource=eqiad%2520prometheus%252Fops [06:04:23] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 272.42 seconds [06:17:53] RECOVERY - MariaDB Slave Lag: s1 on db2070 is OK: OK slave_sql_lag Replication lag: 0.44 seconds [06:22:52] RECOVERY - MariaDB Slave Lag: s1 on db2055 is OK: OK slave_sql_lag Replication lag: 0.24 seconds [06:28:33] PROBLEM - puppet last run on ganeti1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:29:32] PROBLEM - puppet last run on analytics1071 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf] [06:30:33] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_conntrack] [06:56:03] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:03] RECOVERY - puppet last run on ganeti1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:00:03] RECOVERY - puppet last run on analytics1071 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:43:42] RECOVERY - MegaRAID on labstore1003 is OK: OK: optimal, 5 logical, 34 physical [07:50:46] (03PS1) 10Urbanecm: Regenerate logo for bnwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439394 (https://phabricator.wikimedia.org/T196803) [07:53:09] (03PS5) 10Urbanecm: id_privatewikimedia: Initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438279 [08:02:33] 10Operations, 10ops-eqiad, 10DBA: Bad disk on db1063 - https://phabricator.wikimedia.org/T196804#4268877 (10Marostegui) [08:02:44] 10Operations, 10ops-eqiad, 10DBA: Bad disk on db1063 - https://phabricator.wikimedia.org/T196804#4268889 (10Marostegui) p:05Triage>03Normal [08:04:42] 10Operations, 10ops-eqiad, 10DBA: Bad disk on db1065 - https://phabricator.wikimedia.org/T196806#4268905 (10Marostegui) [08:04:54] 10Operations, 10ops-eqiad, 10DBA: Bad disk on db1065 - https://phabricator.wikimedia.org/T196806#4268917 (10Marostegui) p:05Triage>03Normal [08:05:26] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1065 is CRITICAL: cluster=mysql device=megaraid,1 instance=db1065:9100 job=node site=eqiad Marostegui T196806 - The acknowledgement expires at: 2018-06-14 08:05:10. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1065&var-datasource=eqiad%2520prometheus%252Fops [08:05:57] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1063 is CRITICAL: cluster=mysql device=megaraid,3 instance=db1063:9100 job=node site=eqiad Marostegui T196804 - The acknowledgement expires at: 2018-06-14 08:05:46. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1063&var-datasource=eqiad%2520prometheus%252Fops [08:31:43] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1972 bytes in 0.105 second response time [08:35:41] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: Traceback (most recent call last) [08:35:41] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: Traceback (most recent call last) [08:36:42] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1943 bytes in 0.079 second response time [08:41:01] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 8 probes of 302 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [08:41:01] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 1 probes of 322 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [09:01:08] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1961 bytes in 0.077 second response time [09:15:37] PROBLEM - MariaDB Slave Lag: s7 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 360.65 seconds [09:16:17] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1971 bytes in 0.086 second response time [09:18:58] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 468.72 seconds [09:28:28] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1965 bytes in 0.086 second response time [09:33:28] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1950 bytes in 0.089 second response time [09:40:37] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1966 bytes in 0.080 second response time [10:05:58] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1949 bytes in 0.071 second response time [10:18:27] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1968 bytes in 0.069 second response time [10:21:57] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:28:49] (03CR) 10Volans: "Thanks a lot for this initial version, this is great! I've added some general question/proposal inline." (033 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [10:52:08] (03CR) 10Chad: "I'd like to run under dual mode for awhile, to be safe." [puppet] - 10https://gerrit.wikimedia.org/r/408298 (https://phabricator.wikimedia.org/T174034) (owner: 10Paladox) [10:54:07] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1969 bytes in 0.111 second response time [11:26:57] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1965 bytes in 0.076 second response time [11:36:58] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1970 bytes in 0.096 second response time [12:10:12] (03PS3) 10Urbanecm: id_privatewikimedia: register in DNS [dns] - 10https://gerrit.wikimedia.org/r/438275 (https://phabricator.wikimedia.org/T196747) [12:10:16] (03PS3) 10Urbanecm: id_privatewikimedia: add Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/438276 (https://phabricator.wikimedia.org/T196747) [12:12:20] (03PS1) 10Aklapper: Create a FeaturedFeed for the News on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439436 (https://phabricator.wikimedia.org/T165773) [12:13:16] (03CR) 10Aklapper: "I have no idea what I'm doing here. See https://phabricator.wikimedia.org/T165773#4267238 etc." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439436 (https://phabricator.wikimedia.org/T165773) (owner: 10Aklapper) [12:19:12] (03CR) 10Reedy: [C: 04-1] Create a FeaturedFeed for the News on mediawikiwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439436 (https://phabricator.wikimedia.org/T165773) (owner: 10Aklapper) [12:44:24] (03PS2) 10Aklapper: Create a FeaturedFeed for the News on mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439436 (https://phabricator.wikimedia.org/T165773) [12:48:38] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [12:48:44] (03PS1) 10Reedy: Move if onto newline in FeaturedFeedsWMF.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439438 [12:49:15] (03PS2) 10Reedy: Move if onto newline in FeaturedFeedsWMF.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439438 [12:50:32] (03CR) 10Reedy: Move if onto newline in FeaturedFeedsWMF.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439438 (owner: 10Reedy) [12:51:57] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:08:49] 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Someday): Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#4269142 (10Paladox) 05stalled>03Resolved We are now on 2.15 and just tested and emoji's work now! [13:09:39] (03PS4) 10Daimona Eaytoy: Enable $wgAbuseFilterProfile on every wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423660 (https://phabricator.wikimedia.org/T191039) [13:10:07] (03PS2) 10Daimona Eaytoy: Enable $wgAbuseFilterRuntimeProfile on every wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423945 (https://phabricator.wikimedia.org/T191039) [14:39:28] (03PS1) 10Paladox: Gerrit: Make PolyGerrit the default ui [puppet] - 10https://gerrit.wikimedia.org/r/439444 [14:43:38] PROBLEM - MariaDB Slave Lag: s3 on db2074 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.19 seconds [14:43:47] PROBLEM - MariaDB Slave Lag: s3 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.87 seconds [14:43:58] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.72 seconds [14:44:07] PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.71 seconds [14:44:08] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 312.77 seconds [14:44:08] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.36 seconds [14:47:18] (03PS2) 10Paladox: Gerrit: Make PolyGerrit the default ui [puppet] - 10https://gerrit.wikimedia.org/r/439444 (https://phabricator.wikimedia.org/T196812) [14:49:00] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/439444 (https://phabricator.wikimedia.org/T196812) (owner: 10Paladox) [14:49:17] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [14:49:27] that is the WIP button, i didn't write that :) [14:49:36] changes from WIP to read for review now. [14:52:28] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:12:07] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 373.82 seconds [15:19:47] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 32.90 seconds [16:04:17] RECOVERY - MariaDB Slave Lag: s3 on db2074 is OK: OK slave_sql_lag Replication lag: 12.22 seconds [16:04:38] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 0.26 seconds [16:04:38] RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Replication lag: 0.04 seconds [16:04:47] RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [16:04:48] RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Replication lag: 0.09 seconds [16:05:28] RECOVERY - MariaDB Slave Lag: s3 on db2094 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [16:14:08] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1971 bytes in 0.064 second response time [16:24:27] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1974 bytes in 0.090 second response time [16:36:47] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1972 bytes in 0.092 second response time [17:02:08] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1947 bytes in 0.079 second response time [17:09:07] PROBLEM - BGP status on cr2-knams is CRITICAL: BGP CRITICAL - AS13030/IPv6: OpenConfirm, AS13030/IPv4: OpenConfirm [17:20:08] RECOVERY - BGP status on cr2-knams is OK: BGP OK - up: 11, down: 0, shutdown: 0 [17:23:28] PROBLEM - BGP status on cr2-knams is CRITICAL: BGP CRITICAL - AS13030/IPv4: OpenConfirm [17:25:38] RECOVERY - BGP status on cr2-knams is OK: BGP OK - up: 11, down: 0, shutdown: 0 [17:32:27] PROBLEM - BGP status on cr2-knams is CRITICAL: BGP CRITICAL - AS13030/IPv6: Active [17:33:28] RECOVERY - BGP status on cr2-knams is OK: BGP OK - up: 11, down: 0, shutdown: 0 [17:36:48] PROBLEM - BGP status on cr2-knams is CRITICAL: BGP CRITICAL - AS13030/IPv6: OpenConfirm, AS13030/IPv4: OpenConfirm [17:46:38] RECOVERY - BGP status on cr2-knams is OK: BGP OK - up: 11, down: 0, shutdown: 0 [17:49:17] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [17:49:58] PROBLEM - BGP status on cr2-knams is CRITICAL: BGP CRITICAL - AS13030/IPv6: OpenConfirm, AS13030/IPv4: Active [17:52:17] RECOVERY - BGP status on cr2-knams is OK: BGP OK - up: 11, down: 0, shutdown: 0 [17:52:37] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:07:47] PROBLEM - BGP status on cr2-knams is CRITICAL: BGP CRITICAL - AS13030/IPv4: OpenConfirm, AS13030/IPv6: OpenConfirm [18:14:28] RECOVERY - BGP status on cr2-knams is OK: BGP OK - up: 11, down: 0, shutdown: 0 [18:21:08] PROBLEM - BGP status on cr2-knams is CRITICAL: BGP CRITICAL - AS13030/IPv4: OpenConfirm, AS13030/IPv6: Active [18:23:27] RECOVERY - BGP status on cr2-knams is OK: BGP OK - up: 11, down: 0, shutdown: 0 [18:26:47] PROBLEM - BGP status on cr2-knams is CRITICAL: BGP CRITICAL - AS13030/IPv4: OpenConfirm, AS13030/IPv6: OpenConfirm [18:34:37] RECOVERY - BGP status on cr2-knams is OK: BGP OK - up: 11, down: 0, shutdown: 0 [18:56:48] PROBLEM - BGP status on cr2-knams is CRITICAL: BGP CRITICAL - AS13030/IPv6: OpenConfirm, AS13030/IPv4: OpenConfirm [18:57:57] RECOVERY - BGP status on cr2-knams is OK: BGP OK - up: 11, down: 0, shutdown: 0 [19:03:27] PROBLEM - BGP status on cr2-knams is CRITICAL: BGP CRITICAL - AS13030/IPv6: OpenConfirm, AS13030/IPv4: OpenConfirm [19:13:18] RECOVERY - BGP status on cr2-knams is OK: BGP OK - up: 11, down: 0, shutdown: 0 [20:12:38] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1974 bytes in 0.086 second response time [20:19:07] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [20:22:27] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:22:48] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1966 bytes in 0.074 second response time [20:32:18] PROBLEM - MariaDB Slave Lag: s6 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 337.49 seconds [20:32:27] PROBLEM - MariaDB Slave Lag: s6 on db2046 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 340.94 seconds [20:32:28] PROBLEM - MariaDB Slave Lag: s6 on db2039 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 341.68 seconds [20:32:37] PROBLEM - MariaDB Slave Lag: s6 on db2067 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 345.83 seconds [20:32:57] PROBLEM - MariaDB Slave Lag: s6 on db2087 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 354.63 seconds [20:33:07] PROBLEM - MariaDB Slave Lag: s6 on db2060 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 359.20 seconds [20:33:07] PROBLEM - MariaDB Slave Lag: s6 on db2089 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 359.90 seconds [20:33:08] PROBLEM - MariaDB Slave Lag: s6 on db2053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 360.17 seconds [20:33:08] PROBLEM - MariaDB Slave Lag: s6 on db2076 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 360.11 seconds [20:45:18] PROBLEM - Memory correctable errors -EDAC- on scb1002 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=scb1002&var-datasource=eqiad%2520prometheus%252Fops [21:38:34] (03CR) 10Alex Monk: "... Well this is a dependency regardless." [puppet] - 10https://gerrit.wikimedia.org/r/372764 (owner: 10Alex Monk) [21:43:39] (03Abandoned) 10Alex Monk: Fix mwrepl to require expanddblist dependency, from scap::scripts [puppet] - 10https://gerrit.wikimedia.org/r/372764 (owner: 10Alex Monk) [21:45:28] (03CR) 10Alex Monk: "Removed this commit from the deployment-prep puppetmasters." [puppet] - 10https://gerrit.wikimedia.org/r/372764 (owner: 10Alex Monk) [21:54:42] (03PS6) 10Alex Monk: Move some production apache config files to templates [puppet] - 10https://gerrit.wikimedia.org/r/322602 (https://phabricator.wikimedia.org/T1256) [22:00:59] (03CR) 10Alex Monk: "This is the next step in killing a lot of beta-prod divergence and it's trivial. Can someone please review it? Remember Puppet SWAT is not" [puppet] - 10https://gerrit.wikimedia.org/r/322602 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [22:26:13] (03Abandoned) 10Alex Monk: Try to separate trebuchet stuff from role::deployment::server [puppet] - 10https://gerrit.wikimedia.org/r/284851 (owner: 10Alex Monk) [22:28:52] (03Abandoned) 10Alex Monk: Get rid of mw-deployment-vars.sh [puppet] - 10https://gerrit.wikimedia.org/r/316928 (owner: 10Alex Monk) [22:29:14] (03Abandoned) 10Alex Monk: [WIP] Move from ircecho to tcpircbot [puppet] - 10https://gerrit.wikimedia.org/r/240945 (owner: 10Alex Monk) [22:29:38] (03Abandoned) 10Alex Monk: tcpircbot: Allow per-infile channel lists [puppet] - 10https://gerrit.wikimedia.org/r/240939 (owner: 10Alex Monk) [22:30:34] (03CR) 10Alex Monk: [C: 04-1] ""we don't bother rewriting legacy URLs" - breaks existing URLs" [puppet] - 10https://gerrit.wikimedia.org/r/422571 (owner: 10Chad) [22:31:03] (03CR) 10Alex Monk: [C: 04-1] "Yeah except that one breaks legacy URLs." [puppet] - 10https://gerrit.wikimedia.org/r/322425 (owner: 10Alex Monk) [22:34:28] (03CR) 10Alex Monk: "I gave up with trying to make LVS in beta due to labs networking security restrictions - the Neutron upgrade should make it doable (hopefu" [puppet] - 10https://gerrit.wikimedia.org/r/316512 (owner: 10Alex Monk) [22:34:53] (03PS2) 10Alex Monk: deployment-prep: Make LVS config compatible with new requirements [puppet] - 10https://gerrit.wikimedia.org/r/316512 (https://phabricator.wikimedia.org/T196662) [22:50:36] (03Abandoned) 10Paladox: Replace TEMPLATE_CONTEXT_PROCESSORS with TEMPLATES [software/servermon] - 10https://gerrit.wikimedia.org/r/362600 (owner: 10Paladox) [22:50:45] (03PS3) 10Alex Monk: keystone: Create top-level domain for each new project [puppet] - 10https://gerrit.wikimedia.org/r/375089 (https://phabricator.wikimedia.org/T162977) [22:55:26] (03PS2) 10Alex Monk: shinkengen for all projects [puppet] - 10https://gerrit.wikimedia.org/r/374897 (https://phabricator.wikimedia.org/T166845) [22:56:09] (03CR) 10jerkins-bot: [V: 04-1] shinkengen for all projects [puppet] - 10https://gerrit.wikimedia.org/r/374897 (https://phabricator.wikimedia.org/T166845) (owner: 10Alex Monk) [23:00:26] (03PS3) 10Alex Monk: shinkengen for all projects [puppet] - 10https://gerrit.wikimedia.org/r/374897 (https://phabricator.wikimedia.org/T166845) [23:09:47] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - api-https_443: Servers mw1341.eqiad.wmnet are marked down but pooled [23:10:48] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy