[00:01:13] musikanimal: if i find some time i would probably switch xtools usage to https://meta.wikimedia.org/wiki/Special:ApiSandbox#action=sitematrix&format=json . this should have all wikis and can be fetched in a single request instead of caching one for each wiki [00:01:27] actually, hmm i should check if thats actually true ... [00:01:34] it has labswiki at least :) [00:01:46] it still needs to access the labswiki database, which isn't replicated [00:01:59] musikanimal: shouldn't, don't you only need this in global-search for the domain name? [00:02:33] oh, I thought we were talking about XTools [00:02:48] (03CR) 10Tim Starling: "How would you extract from the forensic log a list of long-running requests? Wouldn't we be better off aggregating the request list from m" [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [00:02:51] musikanimal: i got 500 ISE out of global-search, and it turns out to be calls to xtools to find domain name for labswiki results [00:02:57] musikanimal: (sorry, i didn't actually say that earlier...) [00:03:04] ohhhh I see the connection [00:03:11] okay yes that's an issue [00:03:46] musikanimal: also as long as i'm mentioning things ... the results cache key doesn't include namespaces :) I might take some 10% time tomorrow to write a patch, but also other things...will see [00:04:18] oh goody. Thanks for the QA! The cache key thing is an easy fix [00:04:34] !log ebernhardson@deploy1001 Finished scap: php-1.34.0-wmf.6/extensions/CirrusSearch/includes/ T223738 Consider searching out of limits an error (duration: 21m 32s) [00:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:04:40] T223738: PHP Fatal Error on Special:Search with certain offset query parameter - https://phabricator.wikimedia.org/T223738 [00:04:56] musikanimal: also i just realized i switched rooms mid convo... oops [00:04:57] I might do without querying XTools for the domain and just use the replica database [00:05:03] haha [00:05:19] musikanimal: still wouldn't have labswiki [00:05:28] oh yeah, shit [00:05:49] musikanimal: anyways, all fine! i'll look tomorrow at a patch [00:06:25] well I can't change XTools to use sitematrix, because then it will think labswiki is valid and none of the queries will work [00:07:08] interesting that elastic search picks up labswiki even though it's not in the farm [00:07:44] musikanimal: elasticsearch uses everything not in the private dblist [00:07:57] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [00:08:14] so there's probably other wikis that will cause it to break too, not just labswiki [00:08:33] grrr [00:08:48] musikanimal: yea, i think pulling sitematrix from meta is probably easiest. No worries i can write that easily [00:09:19] okay, and just loop through to find a match for the db name. That'll work [00:09:31] that's a better solution anyway, removes the dependency on XTools [00:12:05] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [00:15:20] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 (10Eevans) [00:20:52] !log decommissioning restbase1010-a -- T223976 [00:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:56] T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 [00:27:10] !log remove term protect-old-lvs-servers from cr1/2-eqiad - T224223 [00:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:15] T224223: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 [00:27:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10Traffic, and 2 others: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10ayounsi) `lang=diff,name=cr1/2-eqiad [edit firewall family inet filter border-in4] - /* workaround until lvs1001-lvs1007 are decom'ed */ - term p... [00:31:59] !log remove lvs1001-5 bgp sessions from cr1/2-eqiad - T224223 [00:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10Traffic, and 2 others: decommission lvs100[123456].wikimedia.org - https://phabricator.wikimedia.org/T224223 (10ayounsi) a:05ayounsi→03RobH [01:00:41] 10Operations, 10netops, 10Patch-For-Review: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10ayounsi) [01:21:23] (03PS1) 10Herron: netboot: assign kafka-main[12]00[1-5] 8 disk raid10 partman config [puppet] - 10https://gerrit.wikimedia.org/r/512307 (https://phabricator.wikimedia.org/T223493) [01:23:45] (03CR) 10Herron: [C: 03+2] netboot: assign kafka-main[12]00[1-5] 8 disk raid10 partman config [puppet] - 10https://gerrit.wikimedia.org/r/512307 (https://phabricator.wikimedia.org/T223493) (owner: 10Herron) [01:30:53] (03PS7) 10Ayounsi: Puppet, add RPKI validation daemon [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) [01:48:09] (03PS8) 10Ayounsi: Puppet, add RPKI validation daemon [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) [03:03:31] (03PS9) 10Ayounsi: Puppet, add RPKI validation daemon [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) [03:38:27] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - /usr/lib/nagios/plugins/check_ripe_atlas.py failed with HTTPError: HTTP Error 504: Gateway Time-out https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [03:42:57] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 3 probes of 461 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [03:48:16] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install kafka-main200[1-5] - https://phabricator.wikimedia.org/T223493 (10herron) Kafka-main2001 is installed. I updated the netboot config to assign the partman config to these hostnames, and switched the hardware controller to HBA mode. Then it co... [04:47:51] (03PS1) 10Marostegui: dbproxy1010: Repool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/512309 [04:50:29] (03CR) 10Marostegui: [C: 03+2] dbproxy1010: Repool labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/512309 (owner: 10Marostegui) [05:27:09] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Remove db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512311 (https://phabricator.wikimedia.org/T220170) [05:30:12] !log Reload haproxy on dbproxy1010 to repool labsdb1011 [05:30:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:36] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Remove db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512311 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [05:31:35] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512311 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [05:34:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove db2062 from config T220170 (duration: 00m 49s) [05:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:30] T220170: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 [05:35:38] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Remove db2062 from config T220170 (duration: 00m 48s) [05:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:24] (03PS1) 10Marostegui: mariadb: Move db2062 from s1 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/512312 (https://phabricator.wikimedia.org/T220170) [05:43:14] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db2062 from s1 to m1 [puppet] - 10https://gerrit.wikimedia.org/r/512312 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [05:46:42] (03CR) 10Marostegui: [V: 03+2 C: 03+2] "PCC looks happy: https://puppet-compiler.wmflabs.org/compiler1001/16751/ the violation is a known issue that will be tackled on a refactor" [puppet] - 10https://gerrit.wikimedia.org/r/512312 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [05:58:40] !log mobrovac@deploy1001 Started deploy [restbase/deploy@b153f5d]: Remove Parsoid fallback and rate-limit stashing - T215956 T224055 [05:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:46] T224055: Rate-limit requests in parsoid.js that use stashing - https://phabricator.wikimedia.org/T224055 [05:58:46] T215956: Consider stashing data-parsoid for VE - https://phabricator.wikimedia.org/T215956 [06:05:04] (03PS1) 10Marostegui: db-eqiad.php: More traffic to new hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512313 [06:06:06] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to new hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512313 (owner: 10Marostegui) [06:07:04] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to new hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512313 (owner: 10Marostegui) [06:08:18] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to new hosts T220170 (duration: 00m 48s) [06:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:22] T220170: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 [06:17:14] !log Stop MySQL on db2078:m1 to clone db2062 - T220170 [06:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:18] T220170: Address Database infrastructure blockers on datacenter switchover & multi-dc deployment - https://phabricator.wikimedia.org/T220170 [06:20:11] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@b153f5d]: Remove Parsoid fallback and rate-limit stashing - T215956 T224055 (duration: 21m 30s) [06:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:24] T224055: Rate-limit requests in parsoid.js that use stashing - https://phabricator.wikimedia.org/T224055 [06:20:25] T215956: Consider stashing data-parsoid for VE - https://phabricator.wikimedia.org/T215956 [06:20:54] !log mobrovac@deploy1001 Started deploy [restbase/deploy@b153f5d] (dev-cluster): Remove Parsoid fallback and rate-limit stashing [06:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:22] (03PS1) 10Marostegui: install_server: Do not reimage db2062 [puppet] - 10https://gerrit.wikimedia.org/r/512315 [06:26:35] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@b153f5d] (dev-cluster): Remove Parsoid fallback and rate-limit stashing (duration: 05m 41s) [06:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:28] PROBLEM - puppet last run on analytics1071 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/maven/ivysettings.xml] [06:32:30] PROBLEM - puppet last run on aqs1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] [06:33:02] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/bash_autologout.sh] [06:34:42] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:34:58] (03CR) 10Hashar: "recheck" [debs/helm-diff] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/509015 (owner: 10Fsero) [06:35:00] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:35:43] (03CR) 10jenkins-bot: Invariant config cleanup: IV - DJVU rendering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501006 (owner: 10Jforrester) [06:37:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10Services: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10Joe) [06:37:34] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:37:50] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:38:55] !log restbase-dev1006 puppet disabled - T224260 [06:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:00] T224260: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 [06:39:48] !log restbase-dev1006 stop restbase - T224260 [06:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:01] !log restbase-dev1006 decommission cass-a - T224260 [06:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:04] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:41:06] PROBLEM - BFD status on cr2-eqord is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:41:18] PROBLEM - OSPF status on cr2-eqord is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:41:55] <_joe_> hey can someone look at this ^^ [06:42:05] <_joe_> I'm working on another issue right now [06:42:28] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:42:46] <_joe_> the spike is already over [06:43:04] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:43:29] <_joe_> !log disable notifications in icinga for restbase-dev1006 T224260 [06:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:54] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [06:44:54] 10Operations, 10serviceops: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10MoritzMuehlenhoff) > switch to PHP 7.2 (T224194) We did it for Phabricator for some feature readded in 7.1 and for the main wikis for performance reasons, but for random misc ser... [06:45:37] !log restbase-dev1006 decommission cass-b - T224260 [06:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:41] T224260: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 [06:46:33] RECOVERY - Check whether ferm is active by checking the default input chain on mw2286 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:47:09] !log bounced ferm on mw2286, wasn't correctly started after reboot [06:47:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:51] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:49:35] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [06:49:59] RECOVERY - Check systemd state on mw2286 is OK: OK - running: The system is fully operational [06:54:22] 10Operations, 10ops-eqiad: Broken disk on analytics1039 - https://phabricator.wikimedia.org/T224261 (10MoritzMuehlenhoff) [06:58:35] RECOVERY - puppet last run on aqs1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:57] RECOVERY - puppet last run on analytics1071 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:01:01] (03PS5) 10Acamicamacaraca: Enable VisualEditor in draft namespace on sr.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512220 (https://phabricator.wikimedia.org/T223024) [07:01:47] 10Operations, 10ops-eqiad, 10DC-Ops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10mobrovac) [07:04:35] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [07:05:04] !log restbase-dev1006 force-stop the cassandra instances, fsync exception during decomm - T224260 [07:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:08] T224260: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 [07:09:17] 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10mobrovac) [07:10:54] 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10mobrovac) The node is now ready to be taken over #dc-ops for disk replacement. [07:12:03] 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10MoritzMuehlenhoff) a:03Cmjohnson [07:28:51] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2062 [puppet] - 10https://gerrit.wikimedia.org/r/512315 (owner: 10Marostegui) [07:32:53] !log rebooting labweb* for kernel security update [07:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:12] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [07:33:14] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:56] (03PS1) 10Marostegui: db1072: Remove a left over [puppet] - 10https://gerrit.wikimedia.org/r/512322 [07:40:28] (03PS2) 10Marostegui: db1072: Remove a leftover [puppet] - 10https://gerrit.wikimedia.org/r/512322 [07:42:53] (03CR) 10Marostegui: "As expected, this is a NOOP: https://puppet-compiler.wmflabs.org/compiler1002/16752/" [puppet] - 10https://gerrit.wikimedia.org/r/512322 (owner: 10Marostegui) [07:44:06] (03CR) 10Marostegui: [C: 03+2] db1072: Remove a leftover [puppet] - 10https://gerrit.wikimedia.org/r/512322 (owner: 10Marostegui) [07:50:14] (03PS1) 10Hashar: Polish up Debian packaging [debs/helm-diff] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/512323 [07:51:52] (03CR) 10Hashar: "Seems to be build for me using DIST=buster :]" [debs/helm-diff] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/512323 (owner: 10Hashar) [07:52:51] (03CR) 10Hashar: "(forgot to say: my intent is to get the CI debian-glue job to pass)" [debs/helm-diff] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/512323 (owner: 10Hashar) [08:03:12] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "A couple comments that I think are important, but the removal of the patched files LGTM." (033 comments) [debs/helm-diff] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/512323 (owner: 10Hashar) [08:05:41] news of the day, Debian policy switched to Sphinx https://www.debian.org/doc/debian-policy/ch-controlfields.html :] [08:14:18] (03CR) 10Hashar: "Also the debian patches lack proper description / metadata etc." (034 comments) [debs/helm-diff] (debian/stretch-wikimedia) - 10https://gerrit.wikimedia.org/r/512323 (owner: 10Hashar) [08:20:36] <_joe_> hashar: I don't think DEP-3 tags are a /need/ [08:20:46] <_joe_> but they help in patches, sure [08:20:54] what are DEP-3 tags ? :) [08:21:03] is that what we find in patch files? [08:21:06] <_joe_> the ones for the patches [08:21:09] <_joe_> yes [08:21:11] yeah [08:21:27] I guess the patches just got created magically via git buildpackage or dpkg-source [08:25:55] <_joe_> via git diff, usually [08:26:14] <_joe_> git diff > debian/patches/some-name.patch [08:26:31] <_joe_> or via quilt, if you have many [08:30:16] (03PS1) 10Marostegui: db1064: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/512325 (https://phabricator.wikimedia.org/T223217) [08:32:13] (03CR) 10Marostegui: [C: 03+2] db1064: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/512325 (https://phabricator.wikimedia.org/T223217) (owner: 10Marostegui) [08:34:36] 10Operations, 10ops-eqiad, 10decommission: Decommission rhenium - https://phabricator.wikimedia.org/T224268 (10MoritzMuehlenhoff) [08:36:05] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:36:05] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:37:26] (03PS1) 10Muehlenhoff: Decommission rhenium [puppet] - 10https://gerrit.wikimedia.org/r/512327 (https://phabricator.wikimedia.org/T224268) [08:37:36] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission rhenium - https://phabricator.wikimedia.org/T224268 (10MoritzMuehlenhoff) [08:38:46] (03PS2) 10Muehlenhoff: Decommission rhenium [puppet] - 10https://gerrit.wikimedia.org/r/512327 (https://phabricator.wikimedia.org/T224268) [08:39:37] (03CR) 10Muehlenhoff: [C: 03+2] Decommission rhenium [puppet] - 10https://gerrit.wikimedia.org/r/512327 (https://phabricator.wikimedia.org/T224268) (owner: 10Muehlenhoff) [08:45:15] 10Operations, 10ops-eqiad, 10decommission, 10Patch-For-Review: Decommission rhenium - https://phabricator.wikimedia.org/T224268 (10MoritzMuehlenhoff) p:05Triage→03Normal [08:51:57] (03PS56) 10Gehel: icinga: create and apply cirrus settings check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [08:53:48] (03CR) 10Gehel: [C: 03+2] icinga: create and apply cirrus settings check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [08:54:07] onimisionipe: ^ [08:58:30] Yes! [09:05:22] 10Operations, 10Continuous-Integration-Config: Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10hashar) >>! In T224033#5207686, @Volans wrote: > If I have 2 CRs, chained one on top of another and I +2 both of them because I want to deploy them together, and the first one f... [09:06:48] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10Aklapper) p:05High→03Normal [Lowering priority to reflect](https://www.mediawiki.org/wiki/Phabricator/Project_management#Setting_task_priorities) th... [09:09:20] 10Operations, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10aborrero) >>! In T223902#5209309, @BBlack wrote: > Do these belong in `wikimedia.org` at all? It seems this has already been discussed, bu... [09:21:44] 10Operations, 10Continuous-Integration-Config: Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10hashar) >>! In T224033#5208260 and T224033#5208268 @BBlack wrote: > A few thoughts: > > * None of our CI on ops/puppet provides very strong guarantees of correctness regardless... [09:37:36] (03PS1) 10Arturo Borrero Gonzalez: [RFC] etcd::ssl: restart etcd service when the SSL cert changes [puppet] - 10https://gerrit.wikimedia.org/r/512338 (https://phabricator.wikimedia.org/T169287) [09:38:06] 10Operations, 10Cloud-Services, 10Kubernetes, 10Patch-For-Review: etcd config depends on puppet certs, but puppet doesn't know - https://phabricator.wikimedia.org/T169287 (10aborrero) >>! In T169287#5208609, @Bstorm wrote: > But puppet doesn't run the agent like normal when it modifies a cert. It waits fo... [09:38:26] 10Operations, 10ops-eqiad: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T220880 (10Volans) 05Resolved→03Open p:05Triage→03Normal Re-opening as the disk ended up in a failed state with 2 failed disks! The automatic task was not opened because it was already in alarm in Icinga, so i... [09:39:12] 10Operations, 10ops-eqiad: Degraded RAID on analytics1039 - https://phabricator.wikimedia.org/T220880 (10Volans) [09:39:14] 10Operations, 10ops-eqiad: Broken disk on analytics1039 - https://phabricator.wikimedia.org/T224261 (10Volans) [09:39:20] moritzm: I've reopened the old one and merged yours into it ^^^ [09:40:39] ah, missed that one in phab search [09:41:13] it was closed [09:41:21] I found it via Icinga comment [09:42:47] 10Operations, 10Continuous-Integration-Config: Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10hashar) (I edited my previous comment since I submitted it before I was done with it) Just a note, I have filled this task since several people talked about it recently. My int... [09:44:52] (03CR) 10Arturo Borrero Gonzalez: "A PCC run on prod k8s: https://puppet-compiler.wmflabs.org/compiler1001/16755/" [puppet] - 10https://gerrit.wikimedia.org/r/512338 (https://phabricator.wikimedia.org/T169287) (owner: 10Arturo Borrero Gonzalez) [09:46:41] (03PS1) 10Mathew.onipe: icinga: correct cirrus settings file name [puppet] - 10https://gerrit.wikimedia.org/r/512340 (https://phabricator.wikimedia.org/T218932) [09:48:01] (03PS1) 10Aklapper: toolserver: redirect svgtranslate to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/512341 (https://phabricator.wikimedia.org/T224265) [10:01:47] (03PS1) 10Volans: admin: add jfishback to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/512342 (https://phabricator.wikimedia.org/T222910) [10:03:05] PROBLEM - puppet last run on db1097 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [10:04:23] (03CR) 10Gehel: [C: 03+2] icinga: correct cirrus settings file name [puppet] - 10https://gerrit.wikimedia.org/r/512340 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [10:09:46] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/511766 (owner: 10Jbond) [10:09:55] !log decommission restbase1010-b - T223976 [10:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:07] T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 [10:14:28] (03CR) 10Jbond: [C: 03+2] pybal: remove redundant hiera config [puppet] - 10https://gerrit.wikimedia.org/r/511766 (owner: 10Jbond) [10:14:39] (03PS6) 10Jbond: pybal: remove redundant hiera config [puppet] - 10https://gerrit.wikimedia.org/r/511766 [10:17:30] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: access for foks to labweb (in one way or another) (or make changePassword.php work on mwmaint hosts) - https://phabricator.wikimedia.org/T220860 (10Volans) @jrbs Any update on this? [10:25:48] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [10:25:49] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:25:49] !log rebooting prometheous2004 [10:25:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:51] RECOVERY - puppet last run on db1097 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [10:36:21] PROBLEM - Prometheus prometheus2004/analytics restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1:9905 job=prometheus https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/analytics [10:36:31] PROBLEM - Prometheus prometheus2004/services restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1:9903 job=prometheus https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services [10:36:37] PROBLEM - Prometheus prometheus2004/ops restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [10:37:09] PROBLEM - Prometheus prometheus2004/k8s restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1:9906 job=prometheus https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [10:37:29] PROBLEM - Prometheus prometheus2004/global restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance={127.0.0.1:9900,127.0.0.1:9904} job=prometheus site=codfw https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [10:41:30] volans: i think theses are the alerts you mentioned, how to i clear them? [10:42:30] those go away in 10 minutes becoming warning and back to ok in 30m [10:42:39] ahh ok thanks [10:42:51] what you should look for are other unrelated alarms that might be triggered [10:42:57] do you also know a way i can avalidate if the node has been correctly repooled? [10:43:03] ack [10:43:19] for the UI or the data gathering? [10:43:46] data gathering (i gusse thats what conftool controls) [10:43:49] prometheus is polling the endpoints, nothing pushes data to it, so I guess its logs or that there is data incoming [10:45:41] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban), 10User-Urbanecm, and 2 others: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830 (10Volans) @Urbanecm is it ok to use the email you used to sign the NDA for the related patch in... [10:47:49] what service is load balanced then? [10:50:24] I guess prometheus web interface? https://wikitech.wikimedia.org/wiki/Prometheus#Access_Prometheus_web_interface [10:50:40] <_joe_> the metrics retreiving api I guess [10:51:05] yeah the one we use on grafana [10:51:17] but nothing data-gathering related AFAIK [10:51:33] ack thanks just reading through the docs now and that would seem true [10:51:48] ok thanks ill continue with the next one then :) [10:56:13] !log rebooting prometheous2003 [10:56:14] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [10:56:15] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:16] 10Operations, 10SRE-Access-Requests, 10Release-Engineering-Team (Kanban), 10User-Urbanecm, and 2 others: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830 (10Urbanecm) >>! In T192830#5210075, @Volans wrote: > @Urbanecm is it ok to use the email you use... [11:03:21] PROBLEM - Prometheus prometheus1003/global restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=codfw https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [11:03:43] RECOVERY - Prometheus prometheus2004/k8s restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [11:04:03] RECOVERY - Prometheus prometheus2004/global restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [11:04:17] PROBLEM - Prometheus prometheus1004/global restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=codfw https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [11:04:21] RECOVERY - Prometheus prometheus2004/analytics restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/analytics [11:04:31] RECOVERY - Prometheus prometheus2004/services restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services [11:04:37] RECOVERY - Prometheus prometheus2004/ops restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [11:04:45] RECOVERY - Prometheus prometheus1003/global restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [11:05:41] RECOVERY - Prometheus prometheus1004/global restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [11:06:39] PROBLEM - Prometheus prometheus2003/analytics restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1:9905 job=prometheus https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/analytics [11:07:13] PROBLEM - Prometheus prometheus2003/k8s restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1:9906 job=prometheus https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [11:07:37] PROBLEM - Prometheus prometheus2003/ops restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [11:07:37] PROBLEM - Prometheus prometheus2003/global restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance={127.0.0.1:9900,127.0.0.1:9904} job=prometheus site=codfw https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [11:07:57] PROBLEM - Prometheus prometheus2003/services restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1:9903 job=prometheus https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services [11:08:43] (03PS1) 10Volans: admin: enable shell access for urbanecm [puppet] - 10https://gerrit.wikimedia.org/r/512349 (https://phabricator.wikimedia.org/T192830) [11:09:42] (03CR) 10Volans: "Key verified via:" [puppet] - 10https://gerrit.wikimedia.org/r/512349 (https://phabricator.wikimedia.org/T192830) (owner: 10Volans) [11:23:38] !log rebooting prometheous1004 [11:23:39] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:23:39] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [11:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:45] (03PS2) 10Volans: icinga: fix meta-monitoring sync script [puppet] - 10https://gerrit.wikimedia.org/r/511720 (https://phabricator.wikimedia.org/T222074) [11:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:47] (03PS2) 10Volans: icinga: fix location of meta-monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/511721 (https://phabricator.wikimedia.org/T222074) [11:23:47] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:23:48] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:23:49] (03PS2) 10Volans: icinga: add metamonitor user and its keyholder [puppet] - 10https://gerrit.wikimedia.org/r/511722 (https://phabricator.wikimedia.org/T222074) [11:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:11] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:25:11] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:23] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus_80: Servers prometheus1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:28:37] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus_80: Servers prometheus1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:29:32] ^^ this was me 1003 was rebooted in place of 1004 [11:29:51] server was depooled for most of the downtime [11:29:51] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:30:03] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:30:13] <_joe_> this means the depool_threshold is wrong [11:30:25] RECOVERY - Prometheus prometheus2003/services restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services [11:30:33] RECOVERY - Prometheus prometheus2003/analytics restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/analytics [11:30:39] <_joe_> unless I didn't understand what happened [11:31:07] RECOVERY - Prometheus prometheus2003/k8s restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [11:31:18] _joe_: no i think its a geniune alert. I depooled 1004 and rebooted 1003. i noticed quickly and change the pool state obvioulsy no t quick enough [11:31:31] RECOVERY - Prometheus prometheus2003/ops restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [11:31:31] PROBLEM - Prometheus prometheus2003/global restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=eqiad https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [11:31:33] <_joe_> heh I see :) [11:32:48] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [11:32:48] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:32:49] !log [actully] rebooting prometheous1004 now [11:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:03] (03PS1) 10Mathew.onipe: icinga: cirrus settings check is Ok when file config is empty [puppet] - 10https://gerrit.wikimedia.org/r/512352 (https://phabricator.wikimedia.org/T218932) [11:35:17] PROBLEM - Prometheus prometheus1003/analytics restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9905 job=prometheus https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/analytics [11:35:27] PROBLEM - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [11:35:31] PROBLEM - Prometheus prometheus1003/k8s restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9906 job=prometheus https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s [11:35:41] PROBLEM - Prometheus prometheus1003/global restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance={127.0.0.1:9900,127.0.0.1:9904} job=prometheus site=eqiad https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [11:35:55] PROBLEM - Prometheus prometheus1003/k8s-staging restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9907 job=prometheus https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s-staging [11:36:13] PROBLEM - Prometheus prometheus1003/services restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9903 job=prometheus https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/services [11:40:01] PROBLEM - Prometheus prometheus2003/global restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=eqiad https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [11:43:27] PROBLEM - Prometheus prometheus2004/global restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=eqiad https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [11:43:35] PROBLEM - Prometheus prometheus1004/analytics restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9905 job=prometheus https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/analytics [11:43:37] PROBLEM - Prometheus prometheus1004/k8s-staging restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9907 job=prometheus https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s-staging [11:43:43] PROBLEM - Prometheus prometheus1004/global restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance={127.0.0.1:9900,127.0.0.1:9904} job=prometheus site=eqiad https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [11:44:11] PROBLEM - Prometheus prometheus1004/k8s restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9906 job=prometheus https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s [11:44:45] PROBLEM - Prometheus prometheus1004/services restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9903 job=prometheus https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/services [11:44:50] godog: ^ [11:45:44] !log Updated the Wikidata property suggester with data from the 2019-05-13 JSON dump and applied the T132839 workarounds [11:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:48] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [11:45:51] PROBLEM - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [11:50:51] (03CR) 10Jbond: "Nothing wrong with the change however the requester originally asked for access to deploy1001, naos and terbium. I'm not sure of the last" [puppet] - 10https://gerrit.wikimedia.org/r/512349 (https://phabricator.wikimedia.org/T192830) (owner: 10Volans) [11:51:46] 10Operations, 10DNS, 10Matrix, 10Traffic, and 2 others: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10jbond) ack, thanks for the clarification [11:52:47] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:54:15] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 35, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:54:16] (03PS2) 10Mathew.onipe: icinga: cirrus settings check is Ok when file config is empty [puppet] - 10https://gerrit.wikimedia.org/r/512352 (https://phabricator.wikimedia.org/T218932) [11:57:51] RECOVERY - Prometheus prometheus1003/analytics restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/analytics [11:58:01] RECOVERY - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [11:58:07] RECOVERY - Prometheus prometheus1003/k8s restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s [11:58:15] RECOVERY - Prometheus prometheus1003/global restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [11:58:31] RECOVERY - Prometheus prometheus1003/k8s-staging restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s-staging [11:58:49] RECOVERY - Prometheus prometheus1003/services restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/services [11:58:59] RECOVERY - Prometheus prometheus2004/global restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [12:01:11] RECOVERY - Prometheus prometheus2003/global restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [12:04:39] (03CR) 10Volans: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/512349 (https://phabricator.wikimedia.org/T192830) (owner: 10Volans) [12:06:13] RECOVERY - Prometheus prometheus1004/global restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [12:06:43] RECOVERY - Prometheus prometheus1004/k8s restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s [12:06:59] RECOVERY - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [12:07:15] RECOVERY - Prometheus prometheus1004/services restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/services [12:07:31] RECOVERY - Prometheus prometheus1004/analytics restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/analytics [12:07:35] RECOVERY - Prometheus prometheus1004/k8s-staging restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s-staging [12:13:57] 10Operations, 10ops-eqiad: Storage problems with new host db1133 - https://phabricator.wikimedia.org/T222731 (10Marostegui) This host can be taken down for debugging anytime without heads up to the DBAs - it doesn't even have an OS [12:31:04] 10Operations, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10BBlack) Ok, @aborrero caught me up on all the context on IRC so I can stop asking dumb questions (Thanks!). I think `wikimedia.org` as the... [12:31:55] (03CR) 10Gehel: [C: 04-1] "minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/512352 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [12:38:47] (03CR) 10Jbond: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/512349 (https://phabricator.wikimedia.org/T192830) (owner: 10Volans) [12:43:46] (03CR) 10jenkins-bot: Invariant config cleanup: V - Notifications matters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501007 (owner: 10Jforrester) [12:44:05] (03CR) 10jenkins-bot: Invariant config cleanup: VI - Watchlist default setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501008 (owner: 10Jforrester) [12:46:36] (03PS3) 10Mathew.onipe: icinga: cirrus settings check is Ok when file config is empty [puppet] - 10https://gerrit.wikimedia.org/r/512352 (https://phabricator.wikimedia.org/T218932) [12:46:45] (03CR) 10Mathew.onipe: icinga: cirrus settings check is Ok when file config is empty (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/512352 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [13:07:42] (03CR) 10Muehlenhoff: [C: 03+1] admin: add jfishback to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/512342 (https://phabricator.wikimedia.org/T222910) (owner: 10Volans) [13:18:35] (03PS3) 10Volans: icinga: fix meta-monitoring sync script [puppet] - 10https://gerrit.wikimedia.org/r/511720 (https://phabricator.wikimedia.org/T222074) [13:19:19] (03CR) 10Volans: [C: 03+2] icinga: fix meta-monitoring sync script [puppet] - 10https://gerrit.wikimedia.org/r/511720 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [13:20:01] (03PS4) 10Gehel: icinga: cirrus settings check is Ok when file config is empty [puppet] - 10https://gerrit.wikimedia.org/r/512352 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [13:20:18] (03PS3) 10Volans: icinga: fix location of meta-monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/511721 (https://phabricator.wikimedia.org/T222074) [13:21:15] (03CR) 10Volans: [C: 03+2] icinga: fix location of meta-monitoring script [puppet] - 10https://gerrit.wikimedia.org/r/511721 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [13:21:25] (03CR) 10Gehel: [C: 03+2] icinga: cirrus settings check is Ok when file config is empty [puppet] - 10https://gerrit.wikimedia.org/r/512352 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [13:21:37] (03PS5) 10Gehel: icinga: cirrus settings check is Ok when file config is empty [puppet] - 10https://gerrit.wikimedia.org/r/512352 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [13:21:53] sorry gehel :) [13:22:00] volans: you stole my slot! [13:22:10] lol [13:22:23] do you need a puppet run on icinga? I'll do one anyway I can wait yours too [13:25:42] volans: nope, only on the elastic nodes [13:25:55] ack [13:32:18] (03PS3) 10Volans: icinga: add metamonitor user and its keyholder [puppet] - 10https://gerrit.wikimedia.org/r/511722 (https://phabricator.wikimedia.org/T222074) [13:41:09] (03CR) 10Urbanecm: [C: 03+1] "From the data side, everything looks correctly." [puppet] - 10https://gerrit.wikimedia.org/r/512349 (https://phabricator.wikimedia.org/T192830) (owner: 10Volans) [13:42:28] (03PS1) 10Marostegui: eventlogging.my.cnf: Increase buffer pool from 50G to 300G [puppet] - 10https://gerrit.wikimedia.org/r/512365 (https://phabricator.wikimedia.org/T224291) [13:42:42] (03PS2) 10Volans: admin: enable shell access for urbanecm [puppet] - 10https://gerrit.wikimedia.org/r/512349 (https://phabricator.wikimedia.org/T192830) [13:43:44] (03PS1) 10Giuseppe Lavagetto: Use non-native debian version. [software/service-checker] - 10https://gerrit.wikimedia.org/r/512366 [13:43:55] (03CR) 10Volans: [C: 03+2] admin: enable shell access for urbanecm [puppet] - 10https://gerrit.wikimedia.org/r/512349 (https://phabricator.wikimedia.org/T192830) (owner: 10Volans) [13:44:16] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Use non-native debian version. [software/service-checker] - 10https://gerrit.wikimedia.org/r/512366 (owner: 10Giuseppe Lavagetto) [13:46:41] (03CR) 10jenkins-bot: Invariant config cleanup: X - Extensions loaded on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501012 (owner: 10Jforrester) [13:47:47] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Remove db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512311 (https://phabricator.wikimedia.org/T220170) (owner: 10Marostegui) [13:47:56] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to new hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512313 (owner: 10Marostegui) [13:48:02] (03CR) 10jenkins-bot: Invariant config cleanup: VII - RL local storage setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501009 (owner: 10Jforrester) [13:48:10] (03CR) 10jenkins-bot: Invariant config cleanup: IX - RightsIcon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501011 (owner: 10Jforrester) [13:48:21] (03CR) 10jenkins-bot: Invariant config cleanup: VIII - ULS logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501010 (owner: 10Jforrester) [14:09:17] (03PS1) 10Giuseppe Lavagetto: Merge branch 'master' into stretch [software/service-checker] (stretch) - 10https://gerrit.wikimedia.org/r/512369 [14:11:14] (03PS1) 10Giuseppe Lavagetto: Rebuild for stretch [software/service-checker] (stretch) - 10https://gerrit.wikimedia.org/r/512371 [14:11:34] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Merge branch 'master' into stretch [software/service-checker] (stretch) - 10https://gerrit.wikimedia.org/r/512369 (owner: 10Giuseppe Lavagetto) [14:11:56] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Rebuild for stretch [software/service-checker] (stretch) - 10https://gerrit.wikimedia.org/r/512371 (owner: 10Giuseppe Lavagetto) [14:12:05] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Rebuild for stretch [software/service-checker] (stretch) - 10https://gerrit.wikimedia.org/r/512371 (owner: 10Giuseppe Lavagetto) [14:17:42] (03CR) 10Volans: [C: 03+1] "LGTM, [nit] there are few added newlines that seems unnecessary." [puppet] - 10https://gerrit.wikimedia.org/r/512299 (owner: 10CRusnov) [14:20:40] (03CR) 10Ori.livneh: "TIm, both approaches are useful and I think they would complement each other. I think it depends on whether you are responding to an ongoi" [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [14:23:37] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [14:23:46] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests: Request access to deployment cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223698 (10darthmon_wmde) hi! I am @alaa_wmde's manager at WMDE and hereby I approve that Alaa gets the access needed [14:24:45] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests: Request access to analytics cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223697 (10darthmon_wmde) hi! I am @alaa_wmde's manager at WMDE and hereby I approve that Alaa gets the access needed [14:25:11] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [14:26:43] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [14:30:39] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [14:30:47] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [14:30:57] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [14:33:56] (03CR) 10Hashar: [C: 03+1] "We eventually completely forgot about this change. I guess it needs a rebase now :]" [puppet] - 10https://gerrit.wikimedia.org/r/474824 (owner: 10Thcipriani) [14:34:24] (03CR) 10Hashar: [C: 03+1] "recheck" [software] - 10https://gerrit.wikimedia.org/r/484806 (owner: 10Thcipriani) [14:35:14] (03CR) 10jerkins-bot: [V: 04-1] Use python2 as basepython [software] - 10https://gerrit.wikimedia.org/r/484806 (owner: 10Thcipriani) [14:42:24] (03CR) 10Volans: [C: 04-1] "This needs to wait the creation of the target instance, as per task comments." [dns] - 10https://gerrit.wikimedia.org/r/511842 (https://phabricator.wikimedia.org/T223835) (owner: 10Volans) [14:43:11] * Krinkle staging om mwdebug1002 soon [14:55:04] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.6/includes/libs/objectcache/: d262078b1 / T220470 (duration: 01m 06s) [14:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:10] T220470: Investigate backend save timing regression starting at 2019-04-08 19:15:00 - https://phabricator.wikimedia.org/T220470 [15:01:20] !log upload python{,3}-statsd.3.2.1-2 to jessie-wikimedia [15:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:41] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10kchapman) TechCom is placing this on last call ending July 6th 05:00 UTC/07:00 CEST/July 5 22:00 PDT [15:11:48] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/511686 (owner: 10Jbond) [15:26:07] (03CR) 10CRusnov: [C: 03+2] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/512299 (owner: 10CRusnov) [15:26:19] (03PS2) 10CRusnov: profile::netbox: Add librenms configuration for reports [puppet] - 10https://gerrit.wikimedia.org/r/512299 [15:27:48] (03CR) 10Herron: "What are your thoughts about running a compile/diff run that utilizes the prod private repository? Sadly PCC doesn't always reflect reali" [puppet] - 10https://gerrit.wikimedia.org/r/511686 (owner: 10Jbond) [15:30:30] !log disable bgp to telia on cr1-codfw for X-connect investigation - T222967 [15:30:31] PROBLEM - cassandra-b SSL 10.64.0.115:7001 on restbase1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [15:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:35] T222967: Interface errors on cr1-codfw: xe-5/3/1 - https://phabricator.wikimedia.org/T222967 [15:30:41] PROBLEM - cassandra-b CQL 10.64.0.115:9042 on restbase1010 is CRITICAL: connect to address 10.64.0.115 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [15:36:51] (03CR) 10CRusnov: Add LibreNMS parity check report. (032 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) (owner: 10CRusnov) [15:42:53] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.0.115:9042 on restbase1010 is CRITICAL: connect to address 10.64.0.115 and port 9042: Connection refused eevans Decommissioned (T223976) https://phabricator.wikimedia.org/T93886 [15:42:53] ACKNOWLEDGEMENT - cassandra-b SSL 10.64.0.115:7001 on restbase1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Decommissioned (T223976) https://phabricator.wikimedia.org/T120662 [15:44:56] !log decommissioning restbase1010-c -- T223976 [15:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:01] T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 [15:46:30] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), and 3 others: Requesting access to production for SWAT deploy for Urbanecm - https://phabricator.wikimedia.org/T192830 (10Urbanecm) Addition to nda LDAP group will probably be needed, in order to be able to access... [15:47:47] (03PS1) 10CRusnov: profile::netbox: stop using icinga as remote cron [puppet] - 10https://gerrit.wikimedia.org/r/512387 [15:48:24] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: stop using icinga as remote cron [puppet] - 10https://gerrit.wikimedia.org/r/512387 (owner: 10CRusnov) [15:52:25] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/511686 (owner: 10Jbond) [15:54:20] (03Abandoned) 10CRusnov: profile::netbox: stop using icinga as remote cron [puppet] - 10https://gerrit.wikimedia.org/r/512387 (owner: 10CRusnov) [15:55:22] (03PS5) 10CRusnov: profile::netbox: stop using icinga as remote cron [puppet] - 10https://gerrit.wikimedia.org/r/509445 [15:57:44] (03PS11) 10CRusnov: Add LibreNMS parity check report. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) [15:57:56] 10Operations, 10ops-eqiad: Storage problems with new host db1133 - https://phabricator.wikimedia.org/T222731 (10wiki_willy) @Marostegui - Chris is out on vacation this week, so I'll follow up with him when he's back on Tuesday. ~Willy [15:58:47] (03CR) 10CRusnov: Add LibreNMS parity check report. (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) (owner: 10CRusnov) [16:02:41] 10Operations, 10ops-eqiad: Storage problems with new host db1133 - https://phabricator.wikimedia.org/T222731 (10Marostegui) Thank you! We still have 3 more hosts to keeps us busy with, but as this probably involves getting pieces replaced...it might take a sometime to get them delivered [16:14:01] 10Operations, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10Bstorm) I'm going to propose it here because the IRC conversation this morning was a bit frantic and some breakfix is happening as well: wh... [16:18:29] (03PS1) 10Arturo Borrero Gonzalez: Revert "Hiera backend: update the hiera configuration to remove the role backend" [puppet] - 10https://gerrit.wikimedia.org/r/512392 [16:24:06] 10Operations, 10Mail: User alias redirecting to another user alias - https://phabricator.wikimedia.org/T224254 (10HMarcus) @Volans ah, yes that looks like the root of the issue. Gsuite is configured to redirect `liaison` to `answers`. @Quiddity can you confirm if you would like the `liaison` alias to continue... [16:29:22] 10Operations, 10Mail: User alias redirecting to another user alias - https://phabricator.wikimedia.org/T224254 (10Quiddity) >>! In T224254#5210844, @HMarcus wrote: > @Quiddity can you confirm if you would like the `liaison` alias to continue to redirect to `answers`? If so, the mail team will need to remove th... [16:30:13] (03Abandoned) 10Arturo Borrero Gonzalez: Revert "Hiera backend: update the hiera configuration to remove the role backend" [puppet] - 10https://gerrit.wikimedia.org/r/512392 (owner: 10Arturo Borrero Gonzalez) [16:33:18] (03PS12) 10CRusnov: Add LibreNMS parity check report. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) [16:34:01] 10Operations, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10Andrew) [16:34:11] (03PS13) 10CRusnov: Add LibreNMS parity check report. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) [16:34:48] !log add routinator package to reprepro/APT - T220669 [16:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:53] T220669: RPKI Validation - https://phabricator.wikimedia.org/T220669 [16:35:08] (03PS14) 10CRusnov: Add LibreNMS parity check report. [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) [16:35:45] 10Operations, 10Mail: User alias redirecting to another user alias - https://phabricator.wikimedia.org/T224254 (10HMarcus) Thanks for confirming. @Volans can you please remove `liaison` from your exim configuration of `legalquestions`? [16:37:14] 08̶W̶a̶r̶n̶i̶n̶g Device cr1-codfw.wikimedia.org recovered from Inbound interface errors [16:38:37] (03PS2) 10Volans: admin: add jfishback to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/512342 (https://phabricator.wikimedia.org/T222910) [16:38:39] (03CR) 10CRusnov: Add LibreNMS parity check report. (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/510256 (https://phabricator.wikimedia.org/T221507) (owner: 10CRusnov) [16:39:51] (03CR) 10Volans: [C: 03+2] admin: add jfishback to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/512342 (https://phabricator.wikimedia.org/T222910) (owner: 10Volans) [16:46:43] 10Operations, 10Mail: User alias redirecting to another user alias - https://phabricator.wikimedia.org/T224254 (10Volans) @HMarcus change applied: ` -legalquestions: legal, liaison +legalquestions: legal ` Could you retry? If everything works as expected feel free to resolve the task. [16:51:52] (03PS2) 10Andrew Bogott: openstack puppetmaster profiles: don't include clientpackages [puppet] - 10https://gerrit.wikimedia.org/r/511875 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [16:53:11] (03PS2) 10Andrew Bogott: openstack puppetmaster roles: duplicate for set of profiles to be used in labs [puppet] - 10https://gerrit.wikimedia.org/r/511877 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [16:53:36] (03CR) 10Andrew Bogott: [C: 03+2] openstack puppetmaster profiles: don't include clientpackages [puppet] - 10https://gerrit.wikimedia.org/r/511875 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [16:54:44] (03CR) 10Andrew Bogott: [C: 03+2] openstack puppetmaster roles: duplicate for set of profiles to be used in labs [puppet] - 10https://gerrit.wikimedia.org/r/511877 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [16:55:47] 10Operations, 10SRE-Access-Requests, 10Security-Team, 10User-greg: Requesting access to deployment and analytics-privatedata-users for jfishback - https://phabricator.wikimedia.org/T222910 (10Volans) 05Open→03Resolved a:03Volans All changes merged, confirmed that James can connect to few hosts. As pe... [16:56:26] 10Operations, 10Mail: User alias redirecting to another user alias - https://phabricator.wikimedia.org/T224254 (10HMarcus) @Volans perfect, that resolved it. Thanks so much, will go ahead and close this. [16:56:49] 10Operations, 10Mail: User alias redirecting to another user alias - https://phabricator.wikimedia.org/T224254 (10HMarcus) 05Open→03Resolved [16:59:01] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [16:59:25] 10Operations, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10aborrero) Honestly, `wikimediacloudservices.org` seems overly long. Just reading that makes me feel lazy :-( If possible, I would use it o... [17:03:11] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [17:03:18] (03PS1) 10Volans: admin: add urbanecm to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/512401 (https://phabricator.wikimedia.org/T192830) [17:10:32] 10Operations, 10Mail: User alias redirecting to another user alias - https://phabricator.wikimedia.org/T224254 (10Dzahn) Thanks for removing the number of aliases on our side in general. Note i had just opened a ticket on Zendesk for OIT as well to add all the remaining aliases to legal@ so we can remove all o... [17:13:58] 10Operations, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10Krenair) >>! In T223902#5210949, @aborrero wrote: > ** we could use `$subdomain.wmcloud.org` if this subdomain is not hosted by desginate (... [17:17:27] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:19:49] (03PS4) 10Dzahn: admins: add shell account and admin groups for iflorez [puppet] - 10https://gerrit.wikimedia.org/r/510985 (https://phabricator.wikimedia.org/T223496) [17:20:54] volans: ^ i added gfields@ as expiry contact as suggested. should we merge it? [17:21:08] approved enough by Nuria with her comment? [17:21:52] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for iflorez - https://phabricator.wikimedia.org/T223496 (10Dzahn) >>! In T223496#5209502, @Volans wrote: > @Dzahn I guess @georgina would be more appropriate, th... [17:28:40] 10Operations, 10serviceops, 10PHP 7.2 support, 10Patch-For-Review: switch webserver_misc_apps to PHP 7.2 (7.1) - https://phabricator.wikimedia.org/T224194 (10Dzahn) [17:30:40] (03PS1) 10Dzahn: admins: remove expired contractor account of juliaglen (merge on May 31) [puppet] - 10https://gerrit.wikimedia.org/r/512404 (https://phabricator.wikimedia.org/T214623) [17:38:39] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Bstorm) I can say after a bit more research and experimentation that it would be good to be able to split between a data network that connect... [17:50:27] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to machines [stat1004, stat1005 (now stat1007), and stat1006] and groups for iflorez - https://phabricator.wikimedia.org/T223496 (10Volans) @Nuria all the pre-requisites are there, is this approved? [17:55:31] (03PS1) 10Ayounsi: Add rpki1001/2001 to DNS [dns] - 10https://gerrit.wikimedia.org/r/512405 (https://phabricator.wikimedia.org/T220669) [17:55:50] (03PS1) 10Bstorm: wikilabels: move wikilabels DB to its own server [puppet] - 10https://gerrit.wikimedia.org/r/512406 (https://phabricator.wikimedia.org/T224062) [18:03:32] (03CR) 10CRusnov: [C: 03+1] "lgtm!" [dns] - 10https://gerrit.wikimedia.org/r/512405 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [18:04:07] (03CR) 10Ayounsi: [C: 03+2] Add rpki1001/2001 to DNS [dns] - 10https://gerrit.wikimedia.org/r/512405 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [18:14:01] (03CR) 10Bstorm: [C: 03+2] wikilabels: move wikilabels DB to its own server [puppet] - 10https://gerrit.wikimedia.org/r/512406 (https://phabricator.wikimedia.org/T224062) (owner: 10Bstorm) [18:18:21] 10Operations, 10ops-codfw: Interface errors on cr1-codfw: xe-5/3/1 - https://phabricator.wikimedia.org/T222967 (10Papaul) CY1 did test the X-connect and didn't find any problem. see https://phabricator.wikimedia.org/T224196 Sending other follow up email to Telia [18:26:09] (03PS1) 10Ayounsi: Add cumin alias for rpki hosts [puppet] - 10https://gerrit.wikimedia.org/r/512411 (https://phabricator.wikimedia.org/T220669) [18:27:13] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [18:30:38] (03PS1) 10Bstorm: Revert "wikilabels: move wikilabels DB to its own server" [puppet] - 10https://gerrit.wikimedia.org/r/512412 [18:31:22] (03CR) 10Bstorm: [C: 03+2] Revert "wikilabels: move wikilabels DB to its own server" [puppet] - 10https://gerrit.wikimedia.org/r/512412 (owner: 10Bstorm) [18:38:54] (03PS2) 10Ayounsi: Add cumin alias for rpki hosts [puppet] - 10https://gerrit.wikimedia.org/r/512411 (https://phabricator.wikimedia.org/T220669) [18:42:31] 10Operations, 10MediaWiki-Cache, 10serviceops, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) I'd assume there would be a lot of cou... [18:47:36] (03PS10) 10Ayounsi: Puppet, add RPKI validation daemon [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) [18:47:38] (03PS3) 10Ayounsi: Prometheus, add Routinator endpoint [puppet] - 10https://gerrit.wikimedia.org/r/508956 (https://phabricator.wikimedia.org/T220669) [18:47:40] (03PS3) 10Ayounsi: Add cumin alias for rpki hosts [puppet] - 10https://gerrit.wikimedia.org/r/512411 (https://phabricator.wikimedia.org/T220669) [18:49:01] (03PS1) 10Bstorm: clouddns: fix incorrectly formatted string [puppet] - 10https://gerrit.wikimedia.org/r/512416 [18:59:33] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:00:03] (03CR) 10Alex Monk: [C: 03+1] clouddns: fix incorrectly formatted string [puppet] - 10https://gerrit.wikimedia.org/r/512416 (owner: 10Bstorm) [19:01:03] (03PS2) 10Bstorm: clouddns: fix incorrectly formatted string [puppet] - 10https://gerrit.wikimedia.org/r/512416 [19:02:30] (03PS1) 10Ayounsi: Move rpki2001 A to .codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/512417 [19:03:27] (03CR) 10Ayounsi: [C: 03+2] Move rpki2001 A to .codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/512417 (owner: 10Ayounsi) [19:03:38] (03CR) 10Bstorm: [C: 03+2] clouddns: fix incorrectly formatted string [puppet] - 10https://gerrit.wikimedia.org/r/512416 (owner: 10Bstorm) [19:07:48] 10Operations, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10BBlack) That cloud rebranding link above also mentions `wikimediacloud.org`, which is yet another option nobody's exploiting yet. So even... [19:09:52] (03PS1) 10Jforrester: Even more invariant config moved over to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512418 [19:10:07] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:11:23] (03CR) 10jerkins-bot: [V: 04-1] Even more invariant config moved over to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512418 (owner: 10Jforrester) [19:12:09] (03PS2) 10Jforrester: Even more invariant config moved over to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512418 [19:21:13] (03PS1) 10Ayounsi: Add rpki1001/2001 to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/512419 (https://phabricator.wikimedia.org/T220669) [19:34:08] (03CR) 10Ayounsi: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/16759/" [puppet] - 10https://gerrit.wikimedia.org/r/512419 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [19:34:34] (03PS2) 10Ayounsi: Add rpki1001/2001 to DHCP [puppet] - 10https://gerrit.wikimedia.org/r/512419 (https://phabricator.wikimedia.org/T220669) [19:36:25] (03CR) 10Dzahn: [C: 04-2] "don't merge before May 31st" [puppet] - 10https://gerrit.wikimedia.org/r/512404 (https://phabricator.wikimedia.org/T214623) (owner: 10Dzahn) [19:40:54] 10Operations, 10ops-codfw: Interface errors on cr1-codfw: xe-5/3/1 - https://phabricator.wikimedia.org/T222967 (10Papaul) Dear Customer, Thank you for your email. We have started a case for your query. Telia Carrier case: 00984411. We will investigate this case and get back to you. We appreciate your pat... [19:45:47] (03PS1) 10Ayounsi: Add netboot.cfg config for rpki1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/512421 (https://phabricator.wikimedia.org/T220669) [19:46:49] (03CR) 10Ayounsi: [C: 03+2] Add netboot.cfg config for rpki1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/512421 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [19:47:19] (03PS1) 10Urbanecm: Add abusefilter-modify-restricted to abusefilter group on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512422 (https://phabricator.wikimedia.org/T224308) [19:49:15] (03PS1) 10Andrew Bogott: like this! [puppet] - 10https://gerrit.wikimedia.org/r/512423 [19:49:50] (03CR) 10jerkins-bot: [V: 04-1] like this! [puppet] - 10https://gerrit.wikimedia.org/r/512423 (owner: 10Andrew Bogott) [19:54:09] (03PS1) 10Urbanecm: Test spaces in wgMetaNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512424 [19:54:58] (03CR) 10jerkins-bot: [V: 04-1] Test spaces in wgMetaNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512424 (owner: 10Urbanecm) [19:56:25] 10Operations, 10SRE-Access-Requests: Requesting access to icinga for tonycepo - https://phabricator.wikimedia.org/T224313 (10Tonycepo) [19:57:04] (03PS1) 10Bstorm: wmcs: stop rewriting the zone variable for cnames [puppet] - 10https://gerrit.wikimedia.org/r/512425 (https://phabricator.wikimedia.org/T224000) [19:58:45] (03PS2) 10Urbanecm: Test spaces in wgMetaNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512424 [19:58:56] (03CR) 10Andrew Bogott: [C: 03+1] "This will definitely help" [puppet] - 10https://gerrit.wikimedia.org/r/512425 (https://phabricator.wikimedia.org/T224000) (owner: 10Bstorm) [19:59:15] (03Abandoned) 10Andrew Bogott: like this! [puppet] - 10https://gerrit.wikimedia.org/r/512423 (owner: 10Andrew Bogott) [19:59:19] (03CR) 10Bstorm: [C: 03+2] wmcs: stop rewriting the zone variable for cnames [puppet] - 10https://gerrit.wikimedia.org/r/512425 (https://phabricator.wikimedia.org/T224000) (owner: 10Bstorm) [19:59:34] (03CR) 10jerkins-bot: [V: 04-1] Test spaces in wgMetaNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512424 (owner: 10Urbanecm) [20:03:26] (03PS1) 10Urbanecm: Use underscores instead of spaces in wgMetaNamespace for several Urdu projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512426 (https://phabricator.wikimedia.org/T223039) [20:03:57] (03CR) 10jerkins-bot: [V: 04-1] Use underscores instead of spaces in wgMetaNamespace for several Urdu projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512426 (https://phabricator.wikimedia.org/T223039) (owner: 10Urbanecm) [20:06:22] (03PS11) 10Ayounsi: Puppet, add RPKI validation daemon [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) [20:06:24] (03PS4) 10Ayounsi: Prometheus, add Routinator endpoint [puppet] - 10https://gerrit.wikimedia.org/r/508956 (https://phabricator.wikimedia.org/T220669) [20:06:26] (03PS4) 10Ayounsi: Add cumin alias for rpki hosts [puppet] - 10https://gerrit.wikimedia.org/r/512411 (https://phabricator.wikimedia.org/T220669) [20:06:37] (03PS2) 10Urbanecm: Use underscores instead of spaces in wgMetaNamespace for several Urdu projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512426 (https://phabricator.wikimedia.org/T223039) [20:07:32] (03CR) 10jerkins-bot: [V: 04-1] Use underscores instead of spaces in wgMetaNamespace for several Urdu projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512426 (https://phabricator.wikimedia.org/T223039) (owner: 10Urbanecm) [20:08:53] 10Operations, 10SRE-Access-Requests: Requesting access to icinga for tonycepo - https://phabricator.wikimedia.org/T224313 (10Aklapper) Hi @Tonycepo, thanks for taking the time to report this and welcome to Wikimedia Phabricator! Please see https://phabricator.wikimedia.org/project/profile/956/ and provide a... [20:10:13] (03PS3) 10Urbanecm: Use underscores instead of spaces in wgMetaNamespace for several Urdu projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512426 (https://phabricator.wikimedia.org/T223039) [20:11:51] (03PS3) 10Urbanecm: Test spaces in wgMetaNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512424 [20:12:26] (03PS4) 10Urbanecm: Test spaces in wgMetaNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512424 (https://phabricator.wikimedia.org/T223965) [20:12:58] (03PS1) 10Bstorm: wikilabels: change DNS to a new server [puppet] - 10https://gerrit.wikimedia.org/r/512428 (https://phabricator.wikimedia.org/T224062) [20:13:22] (03CR) 10jerkins-bot: [V: 04-1] Test spaces in wgMetaNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512424 (https://phabricator.wikimedia.org/T223965) (owner: 10Urbanecm) [20:18:58] (03CR) 10Alex Monk: [C: 03+1] wikilabels: change DNS to a new server [puppet] - 10https://gerrit.wikimedia.org/r/512428 (https://phabricator.wikimedia.org/T224062) (owner: 10Bstorm) [20:19:03] (03PS4) 10Urbanecm: Use underscores instead of spaces in wgMetaNamespace for several Urdu projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512426 (https://phabricator.wikimedia.org/T223039) [20:19:25] (03PS5) 10Urbanecm: Test spaces in wgMetaNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512424 (https://phabricator.wikimedia.org/T223965) [20:19:27] (03CR) 10Bstorm: [C: 03+2] wikilabels: change DNS to a new server [puppet] - 10https://gerrit.wikimedia.org/r/512428 (https://phabricator.wikimedia.org/T224062) (owner: 10Bstorm) [20:20:32] (03CR) 10jerkins-bot: [V: 04-1] Test spaces in wgMetaNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512424 (https://phabricator.wikimedia.org/T223965) (owner: 10Urbanecm) [20:20:45] (03PS5) 10Urbanecm: Use underscores instead of spaces in wgMetaNamespace(Talk) for several projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512426 (https://phabricator.wikimedia.org/T223039) [20:21:56] (03PS6) 10Urbanecm: Test spaces in wgMetaNamespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512424 (https://phabricator.wikimedia.org/T223965) [20:23:01] (03PS7) 10Urbanecm: Test spaces in wgMetaNamespace(Talk) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512424 (https://phabricator.wikimedia.org/T223965) [20:23:03] (03CR) 10jerkins-bot: [V: 04-1] Test spaces in wgMetaNamespace(Talk) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512424 (https://phabricator.wikimedia.org/T223965) (owner: 10Urbanecm) [20:26:38] (03PS3) 10Urbanecm: Remove uploader user group from fawiki and merge it with autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505228 (https://phabricator.wikimedia.org/T221441) [20:33:37] PROBLEM - toolschecker: wikilabels read/write on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/db/wikilabelsrw - 241 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [20:33:55] well taht's a new one [20:37:38] (03PS1) 10Urbanecm: Add HD logo for angwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512433 [20:37:47] RECOVERY - toolschecker: wikilabels read/write on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [20:38:05] (03PS2) 10Urbanecm: Add HD logo for angwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512433 (https://phabricator.wikimedia.org/T150618) [20:38:53] (03CR) 10jerkins-bot: [V: 04-1] Add HD logo for angwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512433 (https://phabricator.wikimedia.org/T150618) (owner: 10Urbanecm) [20:42:39] (03CR) 10DannyS712: [C: 03+1] Add abusefilter-modify-restricted to abusefilter group on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512422 (https://phabricator.wikimedia.org/T224308) (owner: 10Urbanecm) [20:51:05] 10Operations, 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Ensure/confirm a way to shell into unpuppetized VMs - https://phabricator.wikimedia.org/T223920 (10Andrew) The good news is that once the firstboot script exits we should be able to get a local console. The bad news is that if the i... [20:58:36] (03PS3) 10Urbanecm: Add HD logo for angwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512433 (https://phabricator.wikimedia.org/T150618) [21:35:10] 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash, 10wmerrors, and 7 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10Legoktm) [21:41:38] (03CR) 10Krinkle: "From a quick glance, looks like this commit might be difficult (or impossible) to deploy correctly as neither IS, CS or mobile.php could b" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512418 (owner: 10Jforrester) [21:43:24] (03PS1) 10Andrew Bogott: cloud image firstboot: don't --waitforcert on first puppet run [puppet] - 10https://gerrit.wikimedia.org/r/512441 (https://phabricator.wikimedia.org/T223920) [21:43:57] (03CR) 10Jforrester: "> Patch Set 2:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512418 (owner: 10Jforrester) [21:46:08] (03CR) 10Krinkle: Even more invariant config moved over to CommonSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512418 (owner: 10Jforrester) [21:48:18] (03CR) 10Andrew Bogott: [C: 03+2] cloud image firstboot: don't --waitforcert on first puppet run [puppet] - 10https://gerrit.wikimedia.org/r/512441 (https://phabricator.wikimedia.org/T223920) (owner: 10Andrew Bogott) [21:50:07] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Move the main WMCS puppetmaster into the Labs realm - https://phabricator.wikimedia.org/T171188 (10Andrew) [21:50:10] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Ensure/confirm a way to shell into unpuppetized VMs - https://phabricator.wikimedia.org/T223920 (10Andrew) 05Open→03Resolved With the attached patch in place, a new VM with no valid puppetmaster will flounde... [21:53:00] (03CR) 10Krinkle: Even more invariant config moved over to CommonSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512418 (owner: 10Jforrester) [21:53:13] (03CR) 10Krinkle: "OK. That sync order seems do-able. Thanks for checking." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512418 (owner: 10Jforrester) [21:53:17] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 (10Eevans) [21:56:26] !log decommissioning restbase1011-a -- T223976 [21:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:31] T223976: Decommission restbase10(0[7-9]|1[0-5]) - https://phabricator.wikimedia.org/T223976 [22:00:28] (03CR) 10Krinkle: Even more invariant config moved over to CommonSettings (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512418 (owner: 10Jforrester) [22:01:40] (03PS1) 10Mholloway: Remove obsolete Wikipedia Zero bits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512442 (https://phabricator.wikimedia.org/T187716) [22:04:19] (03CR) 10Mholloway: [C: 04-1] "Huh... turns out zero.wikimedia.org is still up. Maybe this is premature." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512442 (https://phabricator.wikimedia.org/T187716) (owner: 10Mholloway) [22:04:23] (03CR) 10Krinkle: "I note that https://zero.wikimedia.org is still public. I don't know, but I suppose removing these bits might do something to it, which gi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512442 (https://phabricator.wikimedia.org/T187716) (owner: 10Mholloway) [22:05:22] (03Abandoned) 10Mholloway: Remove obsolete Wikipedia Zero bits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512442 (https://phabricator.wikimedia.org/T187716) (owner: 10Mholloway) [22:15:17] (03PS1) 10Dzahn: Revert "webserver_misc_apps: add PHP7.2 APT repository on stretch" [puppet] - 10https://gerrit.wikimedia.org/r/512445 [22:16:48] 10Operations, 10serviceops: upgrade krypton (webserver_misc_apps) to stretch - https://phabricator.wikimedia.org/T210008 (10Dzahn) [22:16:51] 10Operations, 10serviceops, 10PHP 7.2 support, 10Patch-For-Review: switch webserver_misc_apps to PHP 7.2 (7.1) - https://phabricator.wikimedia.org/T224194 (10Dzahn) 05Open→03Declined using PHP 7.2 was declined in T224247#5209664 This would just mean 7.0. Upgrading to stretch is already covered in T21... [22:17:44] 10Operations, 10serviceops: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10Dzahn) [22:32:25] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:32:54] 10Operations, 10Traffic, 10Wikidata, 10serviceops, and 4 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Dzahn) 05Open→03Stalled [22:34:01] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [22:38:22] 10Operations, 10serviceops, 10vm-requests: ganeti VM request - miscweb2001 - equivalent of krypton - https://phabricator.wikimedia.org/T224323 (10Dzahn) [22:39:35] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [22:41:12] (03PS1) 10Dzahn: rename krypton to miscweb1001 [dns] - 10https://gerrit.wikimedia.org/r/512446 (https://phabricator.wikimedia.org/T224247) [22:48:46] 10Operations, 10serviceops, 10Patch-For-Review: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10Dzahn) >>! In T224247#5209664, @MoritzMuehlenhoff wrote: > miscweb1001/2001? Sounds good! added new name: https://wikitech.wikimedia.org/w/index.php?title=... [22:53:36] 10Operations, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10Krenair) If we're going to divide things up in that manner it would strike me as a bit weird to have the full purposes of the different dom... [23:09:32] (03CR) 10Cwhite: [C: 03+1] "Applied this patch to beta and it appears to dtrt." [puppet] - 10https://gerrit.wikimedia.org/r/512193 (https://phabricator.wikimedia.org/T220103) (owner: 10Cwhite) [23:16:59] 10Operations, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10bd808) {T224324} is probably related in some way at least when we get to figuring out the DNS name to expose that LB setup to Cloud VPS use... [23:23:46] 10Operations, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): cloudcontrol: decide on FQDN for service endpoints - https://phabricator.wikimedia.org/T223902 (10bd808) Reading the discussion here and in irc earlier today, I think the more general topic of which TLDs we are going to use for which pur... [23:25:49] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [23:32:49] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:57:25] 10Operations, 10SRE-Access-Requests: Requesting access to icinga for tonycepo - https://phabricator.wikimedia.org/T224313 (10Tonycepo) Hi @Aklapper , The reason for this request, it's because I want to know more about the alerting. Are you planning to migrate to another alerting software, like PROMETHEUS ALE...