[00:25:57] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [00:27:58] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:34:18] RECOVERY - MariaDB Slave Lag: s7 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [03:26:48] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 847.89 seconds [03:57:03] 10Operations, 10Beta-Cluster-Infrastructure: Mails through deployment-mx SPF & DKIM fails - https://phabricator.wikimedia.org/T87338#4269749 (10Krenair) Probably, I should probably add an SPF record allowing this host to send mail [04:08:35] 10Operations, 10Beta-Cluster-Infrastructure: Mails through deployment-mx SPF & DKIM fails - https://phabricator.wikimedia.org/T87338#4269753 (10Krenair) I've added SPF and DMARC (p=none) records. Haven't done DKIM yet. [04:11:18] PROBLEM - MariaDB Slave Lag: s3 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 404.09 seconds [04:13:28] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 434.86 seconds [04:13:58] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 102.83 seconds [04:19:57] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.79 seconds [04:19:57] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.42 seconds [04:20:07] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 48.94 seconds [04:20:08] PROBLEM - MariaDB Slave Lag: s3 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 304.80 seconds [04:20:28] PROBLEM - MariaDB Slave Lag: s3 on db2074 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 309.82 seconds [04:20:57] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 313.77 seconds [04:20:58] PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 315.15 seconds [04:48:47] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [04:52:07] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:12:58] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 443.26 seconds [05:18:37] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [05:19:37] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 35.17 seconds [05:21:57] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:44:18] PROBLEM - MariaDB Slave Lag: s7 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 382.05 seconds [05:44:47] PROBLEM - MariaDB Slave Lag: s7 on db2061 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 393.64 seconds [06:22:28] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 0.46 seconds [06:22:37] RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [06:22:37] RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Replication lag: 0.33 seconds [06:22:37] RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Replication lag: 0.36 seconds [06:22:47] RECOVERY - MariaDB Slave Lag: s3 on db2094 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [06:23:08] RECOVERY - MariaDB Slave Lag: s3 on db2074 is OK: OK slave_sql_lag Replication lag: 0.41 seconds [06:28:19] (03PS1) 10Alex Monk: Followup If545182a: Actually use cert_name now [puppet] - 10https://gerrit.wikimedia.org/r/439451 [06:30:17] PROBLEM - puppet last run on mw1308 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:30:47] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/add-ldap-group],File[/etc/update-motd.d/97-last-puppet-run] [06:45:17] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={create_container,image_status,podsandbox_status,remove_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:46:18] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [06:55:38] RECOVERY - puppet last run on mw1308 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:56:08] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:57:35] (03PS1) 10Dzahn: wikistats: update name of miraheze import script [puppet] - 10https://gerrit.wikimedia.org/r/439455 (https://phabricator.wikimedia.org/T191245) [07:00:39] (03PS2) 10Dzahn: wikistats: update name of miraheze import script [puppet] - 10https://gerrit.wikimedia.org/r/439455 (https://phabricator.wikimedia.org/T191245) [07:01:54] (03CR) 10Dzahn: [C: 032] wikistats: update name of miraheze import script [puppet] - 10https://gerrit.wikimedia.org/r/439455 (https://phabricator.wikimedia.org/T191245) (owner: 10Dzahn) [07:03:17] PROBLEM - MariaDB Slave Lag: s3 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 302.23 seconds [07:08:27] PROBLEM - MariaDB Slave Lag: s3 on db2057 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.30 seconds [07:08:28] PROBLEM - MariaDB Slave Lag: s3 on db2043 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 300.56 seconds [07:08:37] PROBLEM - MariaDB Slave Lag: s3 on db2036 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.67 seconds [07:08:37] PROBLEM - MariaDB Slave Lag: s3 on db2050 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.90 seconds [07:09:07] PROBLEM - MariaDB Slave Lag: s3 on db2074 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.27 seconds [07:12:38] PROBLEM - MariaDB Slave Lag: s3 on db1124 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 406.74 seconds [07:18:37] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [07:20:18] RECOVERY - MariaDB Slave Lag: s3 on db1124 is OK: OK slave_sql_lag Replication lag: 0.33 seconds [07:21:57] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:29:47] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 22 probes of 301 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [07:32:46] (03PS1) 10Urbanecm: Fix wgMetaNamespace for pswikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439457 (https://phabricator.wikimedia.org/T196837) [07:34:57] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 8 probes of 301 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [07:40:45] (03PS4) 10Urbanecm: id_internalwikimedia: register in DNS [dns] - 10https://gerrit.wikimedia.org/r/438275 (https://phabricator.wikimedia.org/T196747) [07:43:41] (03PS6) 10Urbanecm: id_internalwikimedia: Initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438279 [07:45:43] (03PS4) 10Urbanecm: id_privatewikimedia: add Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/438276 (https://phabricator.wikimedia.org/T196747) [07:46:06] (03PS5) 10Urbanecm: id_internalwikimedia: add Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/438276 (https://phabricator.wikimedia.org/T196747) [07:48:29] (03PS7) 10Urbanecm: id_internalwikimedia: Initial configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/438279 [08:01:08] RECOVERY - MariaDB Slave Lag: s3 on db2074 is OK: OK slave_sql_lag Replication lag: 23.28 seconds [08:01:38] RECOVERY - MariaDB Slave Lag: s3 on db2057 is OK: OK slave_sql_lag Replication lag: 0.06 seconds [08:01:47] RECOVERY - MariaDB Slave Lag: s3 on db2043 is OK: OK slave_sql_lag Replication lag: 0.04 seconds [08:01:48] RECOVERY - MariaDB Slave Lag: s3 on db2036 is OK: OK slave_sql_lag Replication lag: 0.45 seconds [08:01:48] RECOVERY - MariaDB Slave Lag: s3 on db2050 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [08:01:57] RECOVERY - MariaDB Slave Lag: s3 on db2094 is OK: OK slave_sql_lag Replication lag: 0.56 seconds [08:02:38] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [08:03:18] RECOVERY - MariaDB Slave Lag: s7 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 37.95 seconds [08:27:58] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1969 bytes in 0.092 second response time [08:48:18] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1971 bytes in 0.074 second response time [08:54:47] (03PS4) 10EddieGP: mediawiki: Move www.wikimedia.org to wwwportals.conf [puppet] - 10https://gerrit.wikimedia.org/r/424707 (https://phabricator.wikimedia.org/T173887) [08:55:20] (03CR) 10EddieGP: "bump - joe, when will you have time to review this?" [puppet] - 10https://gerrit.wikimedia.org/r/424707 (https://phabricator.wikimedia.org/T173887) (owner: 10EddieGP) [09:01:37] (03CR) 10EddieGP: [C: 04-1] "We will eventually want this, but currently deployment-tin.eqiad.wmflabs is still alive (I just logged in there), definitely not "has been" [puppet] - 10https://gerrit.wikimedia.org/r/438001 (https://phabricator.wikimedia.org/T192071) (owner: 10Dzahn) [09:36:18] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1976 bytes in 0.070 second response time [09:41:27] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1966 bytes in 0.106 second response time [09:48:47] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1963 bytes in 0.079 second response time [09:49:07] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [09:52:18] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:53:48] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1966 bytes in 0.068 second response time [10:12:38] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [10:26:57] RECOVERY - MariaDB Slave Lag: s6 on db2053 is OK: OK slave_sql_lag Replication lag: 20.66 seconds [10:26:57] RECOVERY - MariaDB Slave Lag: s6 on db2060 is OK: OK slave_sql_lag Replication lag: 4.46 seconds [10:27:18] RECOVERY - MariaDB Slave Lag: s6 on db2046 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [10:27:28] RECOVERY - MariaDB Slave Lag: s6 on db2039 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [10:27:37] RECOVERY - MariaDB Slave Lag: s6 on db2067 is OK: OK slave_sql_lag Replication lag: 0.51 seconds [10:27:37] RECOVERY - MariaDB Slave Lag: s6 on db2087 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [10:27:47] RECOVERY - MariaDB Slave Lag: s6 on db2076 is OK: OK slave_sql_lag Replication lag: 0.46 seconds [10:27:48] RECOVERY - MariaDB Slave Lag: s6 on db2089 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [10:39:18] RECOVERY - MariaDB Slave Lag: s6 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 0.24 seconds [10:42:10] 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4269936 (10Marostegui) [10:43:17] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:45:29] 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4269948 (10Marostegui) p:05Triage>03High [10:45:36] 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4269949 (10greg) Related: The Gerrit upgrade included a migration that created many new git refs. Those are replicated to Phabricator and thus it also had to ingest/index them. [10:46:23] 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4269950 (10Marostegui) Ah right! I'm from my phone and cannot check what the writes are. Any ETA for that to be finished? [10:47:07] 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4269951 (10greg) Not sure, @mmodell ? @demon ? [10:53:05] 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4269954 (10Marostegui) Codfw is lagging behind as it cannot cope with the amount of writes. Not a big deal as it is not used, but it is an indicative of how massive it is. It w... [11:03:12] Gerrit down? [11:04:38] 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4269955 (10Paladox) It will be a long while as phab has to parse all the new commits (notedb) we should probaly try to ignore refs/changes/**/meta in phabricator. [11:05:17] marostegui: it’s not loading for me [11:07:41] https://downforeveryoneorjustme.com/gerrit.wikimedia.org [11:08:19] <_joe_> only the web interface [11:08:21] <_joe_> I'm on it [11:08:48] <_joe_> !log restarting gerrit on cobalt as the web interface is unresponsive [11:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:31] <_joe_> marostegui_: do you need gerrit right now? [11:10:29] Nope [11:10:48] As I created a task related to it, I saw there was an alert about it [11:11:09] But as I am on 4G I wasn't sure if it was me only or the alert was indeed correct [11:11:37] <_joe_> gerrit now loads [11:11:49] <_joe_> but it's spitting scary errors [11:11:57] Works for me indeed [11:12:04] <_joe_> marostegui_: log into your nick [11:12:15] Thanks [11:12:23] <_joe_> paladox: I don't think we're ok [11:12:24] _joe_: what kind of errors? [11:12:31] <_joe_> [2018-06-10 11:11:52,155] [OnlineNoteDbMigrator] ERROR com.google.gerrit.server.notedb.rebuild.NoteDbMigrator : Error migrating primary storage for 35785 [11:12:38] That’s ok [11:12:43] <_joe_> that's ok? [11:12:46] Those changes were probaly deleted [11:13:04] <_joe_> ok then, see you tomorrow, hopefully [11:13:07] https://phabricator.wikimedia.org/T196840 [11:13:12] * _joe_ back to lunch [11:13:12] For what it worth [11:13:27] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 3 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy] [11:13:37] PROBLEM - puppet last run on vega is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 3 minutes ago with 5 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikibase/wikiba.se-deploy],Exec[git_pull_research/landing-page],Exec[git_pull_design/landing-page] [11:13:38] PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI Composer] [11:13:41] <_joe_> this is related to the restart ^^ [11:13:43] <_joe_> I think [11:13:48] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/deployment-charts],Exec[git_pull_jenkins CI Composer] [11:14:00] hi, just got online. can i still help [11:14:08] got a text from greg [11:14:19] <_joe_> mutante: I should've fixed the immediate issue [11:14:28] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 4 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki],Exec[git_pull_analytics/reportupdater],Exec[git_pull_wikimedia/discovery/golden] [11:14:32] <_joe_> see the possible cause elsewhere, where I pasted it [11:14:37] ok, thank you [11:14:38] PROBLEM - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_geowiki-scripts],Exec[git_pull_analytics.wikimedia.org] [11:14:38] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [11:14:40] <_joe_> the puppet errors will go away soon [11:14:47] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [11:14:51] <_joe_> I just re-ran puppet on vega, and it's ok [11:15:08] _joe_: it is being hard to log with my nick, my connection is quite bad now :( [11:15:25] <_joe_> marostegui_: ok, go away :D [11:15:30] <_joe_> I'm going off too [11:15:39] Thanks for getting it fixed! [11:15:57] Not sure if it could have been related to the task i created [11:16:07] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [11:16:35] Or marostegui_ not related to your task :) [11:17:28] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [11:17:41] OK :) [11:17:48] 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4269958 (10Marostegui) >>! In T196840#4269955, @Paladox wrote: > It will be a long while as phab has to parse all the new commits (notedb) we should probaly try to ignore refs/... [11:18:38] RECOVERY - puppet last run on vega is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:18:58] 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4269961 (10Paladox) Yep days. [11:29:48] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [11:33:58] RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:34:08] RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:36:37] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:37:48] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:38:47] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:40:08] RECOVERY - puppet last run on analytics1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:40:08] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:40:08] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:59:31] (03PS1) 10MarcoAurelio: Archive the operations/software/tessera repository [software/tessera] - 10https://gerrit.wikimedia.org/r/439467 (https://phabricator.wikimedia.org/T186096) [12:01:28] !log disable some botpasswords (T194204) [12:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:42] (03CR) 10MarcoAurelio: "No .gitreview and `git push origin HEAD:master` gives me permission error. I'll try to see what I can do." [software/tessera] - 10https://gerrit.wikimedia.org/r/439467 (https://phabricator.wikimedia.org/T186096) (owner: 10MarcoAurelio) [12:06:07] 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4270016 (10demon) We haven't needed to replicate any refs other than heads and tags since we brought gitiles online... Disable them. Now. And prune them from Phab while we're... [12:06:32] (03PS1) 10MarcoAurelio: Temporary allow Gerrit Managers to own this repository for archiving purposes [software/tessera] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/439468 (https://phabricator.wikimedia.org/T186096) [12:13:56] (03PS2) 10MarcoAurelio: Archive the operations/software/tessera repository [software/tessera] - 10https://gerrit.wikimedia.org/r/439467 (https://phabricator.wikimedia.org/T186096) [12:14:32] (03Abandoned) 10MarcoAurelio: Temporary allow Gerrit Managers to own this repository for archiving purposes [software/tessera] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/439468 (https://phabricator.wikimedia.org/T186096) (owner: 10MarcoAurelio) [12:15:24] (03PS1) 10MarcoAurelio: Mark repository as read only [software/tessera] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/439469 [12:16:01] (03PS2) 10MarcoAurelio: Mark repository as read only [software/tessera] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/439469 [12:16:33] (03PS3) 10MarcoAurelio: Mark repository as read only [software/tessera] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/439469 (https://phabricator.wikimedia.org/T186096) [12:20:52] (03PS1) 10Reedy: Enable wgCSPReportOnlyHeader for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439470 [12:26:42] (03CR) 10Reedy: [C: 032] Enable wgCSPReportOnlyHeader for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439470 (owner: 10Reedy) [12:28:21] (03Merged) 10jenkins-bot: Enable wgCSPReportOnlyHeader for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439470 (owner: 10Reedy) [12:28:35] (03CR) 10jenkins-bot: Enable wgCSPReportOnlyHeader for group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439470 (owner: 10Reedy) [12:30:04] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: CSP in report mode for group0 (duration: 00m 55s) [12:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:46] (03PS1) 10Reedy: Enable CSP in Report Only Mode everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439471 [12:47:15] 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4270059 (10Paladox) But phabricator changed it's behaviour and now clones refs/**. So to fix this we need regex to not clone refs/changes/**/meta. [13:09:07] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /srv 59430 MB (12% inode=99%) [13:18:57] RECOVERY - Disk space on elastic1019 is OK: DISK OK [13:24:43] (03PS1) 10Urbanecm: Revert "Change bewikiquote logo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439477 (https://phabricator.wikimedia.org/T196134) [13:26:02] (03PS2) 10Urbanecm: Revert "Change bewikiquote logo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439477 (https://phabricator.wikimedia.org/T196134) [13:29:31] (03PS1) 10Urbanecm: Change logo files for bewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439478 (https://phabricator.wikimedia.org/T196134) [13:29:33] (03PS1) 10Urbanecm: Use uploaded HD logo for bewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439479 [13:29:52] (03CR) 10jerkins-bot: [V: 04-1] Use uploaded HD logo for bewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439479 (owner: 10Urbanecm) [13:30:04] (03PS2) 10Urbanecm: Use uploaded HD logo for bewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439479 (https://phabricator.wikimedia.org/T196134) [13:30:18] (03CR) 10jerkins-bot: [V: 04-1] Use uploaded HD logo for bewikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439479 (https://phabricator.wikimedia.org/T196134) (owner: 10Urbanecm) [13:50:07] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1965 bytes in 0.072 second response time [13:53:37] (03PS1) 10Aklapper: Make Phabricator footer links use Special:MyLanguage [puppet] - 10https://gerrit.wikimedia.org/r/439482 (https://phabricator.wikimedia.org/T196836) [14:03:20] (03PS1) 10Paladox: Gerrit: Add CoC and privacy policy to footer [puppet] - 10https://gerrit.wikimedia.org/r/439483 [14:06:20] (03PS2) 10Paladox: Gerrit: Add CoC and privacy policy to footer [puppet] - 10https://gerrit.wikimedia.org/r/439483 (https://phabricator.wikimedia.org/T196835) [14:07:49] (03PS3) 10Paladox: Gerrit: Add CoC and privacy policy to footer [puppet] - 10https://gerrit.wikimedia.org/r/439483 (https://phabricator.wikimedia.org/T196835) [14:10:27] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1965 bytes in 0.070 second response time [14:14:35] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/439483 (https://phabricator.wikimedia.org/T196835) (owner: 10Paladox) [14:17:38] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1972 bytes in 0.074 second response time [14:37:58] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1953 bytes in 0.066 second response time [14:45:18] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1976 bytes in 0.095 second response time [15:00:38] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1942 bytes in 0.087 second response time [15:57:34] Are the weird centralauth using the wrong db errors at https://logstash.wikimedia.org/goto/90399bf0d838acda35862bc488c28a05 a known issue? [17:00:18] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container,stop_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:01:28] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [17:11:43] (03CR) 10Framawiki: [C: 031] Make Phabricator footer links use Special:MyLanguage [puppet] - 10https://gerrit.wikimedia.org/r/439482 (https://phabricator.wikimedia.org/T196836) (owner: 10Aklapper) [18:48:35] (03CR) 10MarcoAurelio: [C: 031] Make Phabricator footer links use Special:MyLanguage [puppet] - 10https://gerrit.wikimedia.org/r/439482 (https://phabricator.wikimedia.org/T196836) (owner: 10Aklapper) [19:32:37] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1968 bytes in 0.065 second response time [19:37:38] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1953 bytes in 0.076 second response time [19:43:54] (03PS1) 10Paladox: Rename wikimedia-polygerrit-style.html to gerrit-theme.html and also add footer links [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439503 [19:55:08] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1966 bytes in 0.063 second response time [19:59:17] (03PS2) 10Paladox: Rename wikimedia-polygerrit-style.html to gerrit-theme.html and also add footer links [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439503 [19:59:41] (03PS3) 10Paladox: Rename wikimedia-polygerrit-style.html to gerrit-theme.html and also add footer links [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439503 (https://phabricator.wikimedia.org/T196835) [20:00:08] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1945 bytes in 0.070 second response time [20:00:36] (03PS1) 10Paladox: Rename wikimedia-polygerrit-style.html to gerrit-theme.html [puppet] - 10https://gerrit.wikimedia.org/r/439504 (https://phabricator.wikimedia.org/T196835) [20:02:32] (03PS2) 10Paladox: Rename wikimedia-polygerrit-style.html to gerrit-theme.html [puppet] - 10https://gerrit.wikimedia.org/r/439504 (https://phabricator.wikimedia.org/T196835) [20:03:05] (03CR) 10Paladox: "this should be merged at the same time as the puppet change https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/439504/" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439503 (https://phabricator.wikimedia.org/T196835) (owner: 10Paladox) [20:03:26] (03CR) 10Paladox: "This change is ready for review." [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439503 (https://phabricator.wikimedia.org/T196835) (owner: 10Paladox) [20:03:51] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/439504 (https://phabricator.wikimedia.org/T196835) (owner: 10Paladox) [20:07:27] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1965 bytes in 0.084 second response time [20:17:37] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1971 bytes in 0.061 second response time [21:58:26] 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4270673 (10Paladox)