[00:26:52] (03PS3) 10Paladox: WIP: Planet: Redesgn UI [puppet] - 10https://gerrit.wikimedia.org/r/435327 [00:31:37] (03PS4) 10Paladox: WIP: Planet: Redesgn UI [puppet] - 10https://gerrit.wikimedia.org/r/435327 [00:33:04] (03PS5) 10Paladox: WIP: Planet: Redesgn UI [puppet] - 10https://gerrit.wikimedia.org/r/435327 [00:39:38] (03PS6) 10Paladox: WIP: Planet: Redesgn UI [puppet] - 10https://gerrit.wikimedia.org/r/435327 [00:48:21] PROBLEM - HP RAID on labsdb1009 is CRITICAL: CRITICAL: Slot 1: Failed: 1I:1:9 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:10, 1I:1:11, 1I:1:12, 1I:1:13, 1I:1:14, 1I:1:15, 1I:1:16 - Controller: OK - Battery/Capacitor: OK [00:48:22] ACKNOWLEDGEMENT - HP RAID on labsdb1009 is CRITICAL: CRITICAL: Slot 1: Failed: 1I:1:9 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:10, 1I:1:11, 1I:1:12, 1I:1:13, 1I:1:14, 1I:1:15, 1I:1:16 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T195690 [00:48:28] 10Operations, 10ops-eqiad: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4234323 (10ops-monitoring-bot) [01:12:10] (03PS2) 10Legoktm: Add `webservice-python-bootstrap` command [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/435662 (https://phabricator.wikimedia.org/T174769) [01:12:52] (03CR) 10Legoktm: "PS2: Moved to base image based on IRC discussion that base is the image that the webservice shell command uses." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/435662 (https://phabricator.wikimedia.org/T174769) (owner: 10Legoktm) [01:13:21] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect, AS6939/IPv6: Active [01:17:02] (03PS3) 10Legoktm: Add `webservice-python-bootstrap` command [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/435662 (https://phabricator.wikimedia.org/T174769) [01:17:20] (03CR) 10Legoktm: "...back to PS1 now :)" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/435662 (https://phabricator.wikimedia.org/T174769) (owner: 10Legoktm) [01:17:42] RECOVERY - BGP status on cr2-ulsfo is OK: BGP OK - up: 98, down: 0, shutdown: 4 [01:23:12] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active, AS6939/IPv4: Connect [01:25:12] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 31 probes of 304 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [01:29:32] PROBLEM - Device not healthy -SMART- on labsdb1009 is CRITICAL: cluster=mysql device=cciss,15 instance=labsdb1009:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=labsdb1009&var-datasource=eqiad%2520prometheus%252Fops [01:30:21] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 8 probes of 304 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [01:56:02] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet operation_type={container_status,create_container,image_status,podsandbox_status,remove_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:57:11] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:56:53] (03Abandoned) 10Chad: WIP: Adding a "Deployed to" bit for the "Included In" header [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/414607 (owner: 10Chad) [02:57:59] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4234362 (10demon) a:05demon>03None Actually, I won't be able to get this done. Unlicking the cookie. [03:03:11] (03PS1) 10Alex Monk: puppetdb: Don't try to install PostgreSQL tuning.conf until PostgreSQL directories exist [puppet] - 10https://gerrit.wikimedia.org/r/435677 [03:03:39] (03CR) 10Alex Monk: "Full error:" [puppet] - 10https://gerrit.wikimedia.org/r/435677 (owner: 10Alex Monk) [03:03:54] (03CR) 10jerkins-bot: [V: 04-1] puppetdb: Don't try to install PostgreSQL tuning.conf until PostgreSQL directories exist [puppet] - 10https://gerrit.wikimedia.org/r/435677 (owner: 10Alex Monk) [03:29:35] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 753.51 seconds [03:33:25] PROBLEM - puppet last run on analytics1064 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz],File[/usr/share/GeoIP/GeoIP2-City.mmdb.test] [03:37:16] PROBLEM - puppet last run on mw2277 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-ISP.mmdb.gz] [04:02:36] RECOVERY - puppet last run on mw2277 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:03:45] RECOVERY - puppet last run on analytics1064 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:43:05] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 162.98 seconds [04:46:25] PROBLEM - Host mw1280 is DOWN: PING CRITICAL - Packet loss = 100% [05:48:45] PROBLEM - MariaDB Slave Lag: s8 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 87330.95 seconds [06:31:45] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/60-update-ocsp-all.conf] [06:57:05] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:41:18] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on labsdb1009 - https://phabricator.wikimedia.org/T195690#4234397 (10Marostegui) p:05Triage>03Normal a:03Cmjohnson @Cmjohnson can we get a new disk ordered? This host should be under warranty Thanks! [08:33:09] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4234417 (10Marostegui) I have compared a few tables between all the sections on the current and future definitive sanitarium ho... [08:36:53] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561#4234419 (10EddieGP) It seems you used the same flavor for deploy1001 that tin had. This would've been a great time to switch to a differen... [09:47:36] RECOVERY - MegaRAID on db1054 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [10:07:06] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Make deployment-prep puppetmaster more similar to Production puppetmaster - https://phabricator.wikimedia.org/T146627#4234462 (10EddieGP) 05Open>03Resolved a:03EddieGP According to openstack browser this uses ::standalone as of now. Seems the ony t... [10:17:46] PROBLEM - MegaRAID on db1054 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [10:39:45] RECOVERY - BGP status on cr2-ulsfo is OK: BGP OK - up: 98, down: 0, shutdown: 4 [13:44:42] (03PS1) 10Nehajha: read command line arguments from a config file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/435691 [13:50:55] (03PS1) 10Nehajha: print the type of webservice running [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/435692 [13:57:02] any ops around who could 'apt-cache policy puppetdb' on a prod puppetdb host? [14:05:44] Krenair: done in pvt [14:07:08] alright I see what I'm missing now, thanks [14:14:40] yes this is working much better now, thanks [14:15:28] turns out my puppetdb host was missing puppetdb_major_version from hiera which was causing it to miss the wikimedia-puppetdb4 apt::repository which was causing it to have an old incompatible version of puppetdb [14:19:05] (03PS2) 10Alex Monk: puppetdb: Don't try to install tuning.conf until dir/package exists [puppet] - 10https://gerrit.wikimedia.org/r/435677 [14:33:42] (03PS1) 10Urbanecm: Create 2 extra namespaces for bdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435693 (https://phabricator.wikimedia.org/T195700) [14:44:28] (03PS1) 10Urbanecm: Add 2 namespace aliases to bdwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/435694 (https://phabricator.wikimedia.org/T195700) [15:25:25] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet operation_type={container_status,create_container,podsandbox_status,remove_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:26:26] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:29:46] anyone used puppet-ecdsacert lately? [15:35:00] looks like it doesn't quite work well in labs [15:35:06] without a config [15:37:11] unless you're using the central puppetmaster I guess [15:39:23] due to args[:puppetca] defaulting to 'puppet' which, well: puppet. 37 IN A 208.80.154.158 [15:39:38] 158.154.80.208.in-addr.arpa. 1723 IN PTR labpuppetmaster1001.wikimedia.org. [15:39:56] can't set up puppet.deployment-prep.eqiad.wmflabs etc. without novaadmin [15:46:19] alright documented at https://wikitech.wikimedia.org/wiki/Puppet-ecdsacert [15:55:36] (03PS1) 10Alex Monk: puppet-ecdsacert: verify connection to puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/435697 [16:18:36] RECOVERY - MariaDB Slave Lag: s8 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 33.20 seconds [19:04:12] 10Operations, 10Gerrit, 10Patch-For-Review, 10Release-Engineering-Team (Someday): Gerrit shows HTTP 500 error when pasting extended unicode characters - https://phabricator.wikimedia.org/T145885#4235014 (10Paladox) The gerrit upgrade will be happening soon so as soon as the upgrade is done notedb will quic... [19:59:48] (03PS1) 10Aklapper: phabricator: Make account names link to their Phab profiles [puppet] - 10https://gerrit.wikimedia.org/r/435713 [20:28:42] (03PS10) 10Alex Monk: swift: use implicit /dev/swift prefix for swift devices [puppet] - 10https://gerrit.wikimedia.org/r/361648 (https://phabricator.wikimedia.org/T163673) (owner: 10Filippo Giunchedi) [20:32:14] (03PS2) 10Alex Monk: swift: Fix checks on drive/filesystem titles to allow for labs ones [puppet] - 10https://gerrit.wikimedia.org/r/402758 (https://phabricator.wikimedia.org/T184236) [20:40:02] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10media-storage, 10Patch-For-Review: Puppet broken on deployment-ms-be0[34] with evaluation error in swift module - https://phabricator.wikimedia.org/T184236#4235037 (10Krenair) Rebased the patches and re-cherry-picked [20:55:11] 10Operations, 10Beta-Cluster-Infrastructure, 10Patch-For-Review, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Move deployment-prep redis instances to stretch - https://phabricator.wikimedia.org/T179371#4235039 (10Krenair) @fgiunchedi: So we basically need to find someone to review the puppet patch... [21:21:44] (03PS1) 10Alex Monk: Replace etcd cert after puppetmaster change [puppet] - 10https://gerrit.wikimedia.org/r/435715 (https://phabricator.wikimedia.org/T195686) [22:48:31] (03PS3) 10Alex Monk: Fix mwrepl to require expanddblist dependency, from scap::scripts [puppet] - 10https://gerrit.wikimedia.org/r/372764 [22:49:04] (03CR) 10jerkins-bot: [V: 04-1] Fix mwrepl to require expanddblist dependency, from scap::scripts [puppet] - 10https://gerrit.wikimedia.org/r/372764 (owner: 10Alex Monk) [22:50:48] (03PS4) 10Alex Monk: Allow use of PuppetDB in labs for ssh_known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/333471 (https://phabricator.wikimedia.org/T72792) [22:52:17] (03CR) 10Alex Monk: "Jenkins is complaining about including other modules now?" [puppet] - 10https://gerrit.wikimedia.org/r/372764 (owner: 10Alex Monk) [22:56:57] (03PS2) 10Alex Monk: deployment-prep: Replace etcd cert after puppetmaster change [puppet] - 10https://gerrit.wikimedia.org/r/435715 (https://phabricator.wikimedia.org/T195686) [23:15:12] (03CR) 10Alex Monk: "(cherry picked, of course)" [puppet] - 10https://gerrit.wikimedia.org/r/435715 (https://phabricator.wikimedia.org/T195686) (owner: 10Alex Monk) [23:15:25] (03CR) 10Alex Monk: "(cherry picked)" [puppet] - 10https://gerrit.wikimedia.org/r/435631 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [23:15:41] (03CR) 10Alex Monk: "(cherry picked)" [puppet] - 10https://gerrit.wikimedia.org/r/435670 (owner: 10Alex Monk)